CN113938746B - Network live broadcast audio processing method and device, equipment, medium and product thereof - Google Patents

Network live broadcast audio processing method and device, equipment, medium and product thereof Download PDF

Info

Publication number
CN113938746B
CN113938746B CN202111144000.5A CN202111144000A CN113938746B CN 113938746 B CN113938746 B CN 113938746B CN 202111144000 A CN202111144000 A CN 202111144000A CN 113938746 B CN113938746 B CN 113938746B
Authority
CN
China
Prior art keywords
audio data
signal
local
far
loop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111144000.5A
Other languages
Chinese (zh)
Other versions
CN113938746A (en
Inventor
何鑫
苏嘉昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huaduo Network Technology Co Ltd
Original Assignee
Guangzhou Huaduo Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huaduo Network Technology Co Ltd filed Critical Guangzhou Huaduo Network Technology Co Ltd
Priority to CN202111144000.5A priority Critical patent/CN113938746B/en
Publication of CN113938746A publication Critical patent/CN113938746A/en
Application granted granted Critical
Publication of CN113938746B publication Critical patent/CN113938746B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Databases & Information Systems (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application relates to a network live audio processing technology, and discloses a network live audio processing method and a device, equipment, medium and product thereof, wherein the method comprises the following steps: acquiring far-end audio data in a live broadcasting room connection state, mixing the far-end audio data with local audio data to obtain play audio data; performing echo cancellation on the real-time audio data acquired by the local machine by taking the far-end audio data as a reference signal to obtain intermediate audio data, wherein echo signals of the far-end audio data are cancelled and echo signals of the local audio data are reserved; mixing the local audio data with the intermediate audio data after superposing the local loop-back delay value to obtain mixed audio data; and pushing the live stream containing the mixed audio data to the live broadcasting room. The application can effectively eliminate the double-talk phenomenon caused by the connection of multiple persons in the network live broadcast process, can effectively ensure the conversation tone quality and realize the synchronous alignment of the sound of multiple sound sources in the time domain.

Description

Network live broadcast audio processing method and device, equipment, medium and product thereof
Technical Field
The present application relates to a live network audio processing technology, and in particular, to a live network audio processing method, a corresponding apparatus, a computer device, a computer readable storage medium, and a computer program product thereof.
Background
Echo cancellation is often applied to voice communication, and the transfer function of an echo path is estimated through an adaptive filtering algorithm, and the reference signal is filtered through the transfer function, so that the effect of removing the echo is achieved. Under the condition of double-talk of two communication parties, the linear relation between the reference signal and the acquisition signal is destroyed, the convergence of the adaptive filter coefficient is affected, at the moment, the echo path is estimated through adaptive filtering to cause distortion, and then the voice signal after echo cancellation has the conditions of word dropping, blocking, echo leakage and the like.
In a network live scene, there are multiple sound sources. The anchor plays the background music locally and pushes the music to the audience, if the anchor links with the audience in the scene, the anchor will pull the sound of the link audience. The above-mentioned scene processing flow is shown in fig. 1, wherein the dashed box "pull stream" indicates that the scene processing flow is executed when the connection is made.
In the prior art, as shown in fig. 2, the echo cancellation principle uses locally played sound as a reference signal in the echo cancellation process, and the locally played sound is continuously played due to the continuity of the music signal, so that a double talk is formed as soon as the anchor terminal speaks no matter whether a situation of pulling a stream to a connected audience exists or not. Due to the frequent occurrence of double talk, the voice of the anchor user after the echo cancellation can appear in the situations of word drop, echo leakage and the like.
Fig. 3 is a schematic diagram of actual measurement effect obtained by echo cancellation with the prior art, and the diagram is a waveform diagram corresponding to a main broadcasting signal, a music signal, an acquisition signal and an echo cancellation signal from top to bottom, wherein the acquisition signal is a mixture of the main broadcasting speaking signal acquired by a terminal device and an echo obtained by reflecting a local broadcasting music signal through a room, and the echo cancellation signal is a signal obtained by echo cancellation of the acquisition signal by taking the broadcasting music signal as a reference signal. As is evident from the "acquisition signal", the acquisition signal is always in the double talk state when the anchor talks to produce the "anchor signal". By comparing the "anchor signal" with the "echo cancellation signal" it can be seen that the signal loss is significant, most notably that a word drop has occurred, resulting in a sound stuck. In addition, leaky echoes also occur in the "echo cancellation signal", mainly in the 4 second position in the time domain.
In view of this, a new technical solution is needed to overcome the problem that the word loss, the clip-on and the echo leakage occur due to the frequent double-talk in the scene of locally playing music and pushing in the network live broadcast process.
Disclosure of Invention
It is a primary object of the present application to solve at least one of the above problems and to provide a method for processing live webcast audio and corresponding apparatus, computer device, computer-readable storage medium, computer program product.
In order to meet the purposes of the application, the application adopts the following technical scheme:
the application provides a network live broadcast audio processing method which is suitable for one of the purposes of the application, and comprises the following steps:
acquiring far-end audio data in a live broadcasting room connection state, mixing the far-end audio data with local audio data to obtain play audio data;
performing echo cancellation on the real-time audio data acquired by the local machine by taking the far-end audio data as a reference signal to obtain intermediate audio data, wherein echo signals of the far-end audio data are cancelled and echo signals of the local audio data are reserved;
mixing the local audio data with the intermediate audio data after superposing the local loop-back delay value to obtain mixed audio data;
and pushing the live stream containing the mixed audio data to the live broadcasting room.
In a deepened embodiment, remote audio data in a live broadcasting room connection state is obtained, and is played after being mixed with local audio data, and the method comprises the following steps:
Acquiring a remote live stream pushed by a server in a live broadcasting room connection state;
extracting far-end audio data from the far-end live stream;
mixing the far-end audio data with the local audio data to obtain externally-placed audio data;
and converting the voice signal according to the external audio data to play.
In a deepened embodiment, echo cancellation is performed on real-time audio data acquired by a local machine by taking the far-end audio data as a reference signal to obtain intermediate audio data, and the method comprises the following steps:
continuously acquiring real-time input voice signals from a local sound card to obtain real-time audio data;
applying a preset adaptive echo filtering algorithm, taking the far-end audio data as a reference signal, and executing echo cancellation processing on the real-time audio data to cancel echo signals corresponding to the far-end audio data;
and retaining echo signals corresponding to the local audio data in the real-time audio data as the intermediate audio data.
In a specific embodiment, mixing the local audio data with the intermediate audio data after superimposing the local loop-back delay value to obtain mixed audio data, including the following steps:
Acquiring a loopback delay value corresponding to the local equipment;
and controlling the local audio data to mix with the intermediate audio data according to the loop-back delay value hysteresis, so as to obtain mixed audio data.
In a further embodiment, obtaining a loopback delay value corresponding to the local device includes the following steps:
presetting a loop-back identification signal to the audio-out data at a first moment, wherein the loop-back identification signal is a high-frequency signal outside the hearing frequency band of the human ear and comprises a plurality of single-frequency signals, and the single-frequency signals are arranged at equal intervals on the frequency domain;
detecting whether the loopback identifying signal exists in the real-time audio data, and determining a second moment when the loopback identifying signal is detected;
determining the loopback delay value according to the difference between the first time and the second time;
the loop-back delay value is stored for subsequent direct invocation.
In a further embodiment, detecting whether the loop-back identification signal exists in the real-time audio data includes the following steps:
tracking a noise signal of the real-time audio data along a time domain, and transforming the noise signal to a frequency domain to obtain corresponding noise energy spectrum data;
according to voice energy spectrum data mapped by voice frames of the real-time audio data, positioning peak positions of all frequency points;
Calculating the existence probability of the loopback recognition signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point;
and judging that the loop-back identification signal is detected when the existence probability of a plurality of continuous voice frames meets the preset condition.
In an extended embodiment, the step of performing echo cancellation on the real-time audio data acquired by the local unit by using the far-end audio data as a reference signal to cancel an echo signal corresponding to the far-end audio data, and obtaining intermediate audio data includes the following steps:
and detecting an echo cancellation function switch of the user in the live broadcasting room logged in by the local machine, and starting a subsequent step when the echo cancellation function switch is in an activated state, otherwise, not executing the subsequent step.
A network live audio processing apparatus according to one of the objects of the present application comprises:
the streaming playing module is used for acquiring far-end audio data in the live broadcasting room connection state, mixing the far-end audio data with the local audio data to obtain external audio data, and playing the external audio data;
the echo cancellation module is used for performing echo cancellation on the real-time audio data acquired by the local machine by taking the far-end audio data as a reference signal to obtain intermediate audio data, wherein the echo signal of the far-end audio data is cancelled and the echo signal of the local audio data is reserved;
The loop-back correction module is used for mixing the local audio data with the intermediate audio data after the local loop-back delay value is overlapped to obtain mixed audio data;
and the live broadcast pushing module is used for pushing the live broadcast stream containing the mixed audio data to the live broadcast room.
A computer device provided in accordance with one of the objects of the present application comprises a central processor and a memory, said central processor being adapted to invoke the steps of running a computer program stored in said memory for performing the method of live audio processing according to the present application.
A computer readable storage medium adapted to another object of the present application stores a computer program implemented according to the network live audio processing method in the form of computer readable instructions, which when invoked by a computer, performs the steps comprised by the method.
A computer program product adapted to another object of the application is provided comprising computer programs/instructions which when executed by a processor implement the steps of the method for live audio processing as described in any of the embodiments of the application.
Compared with the prior art, the application has the following advantages:
Firstly, in the network live connection state, echo cancellation is carried out in real-time audio data acquired by a local machine, local audio data and far-end audio data are decoupled in an echo cancellation link, only the far-end audio data are selected as reference signals in the echo cancellation process, echo signals generated by the fact that the far-end audio data are externally put in the local machine in the real-time audio data are cancelled, echo signals generated by the fact that the local audio data are broadcast are reserved, and intermediate audio data are obtained, therefore, the echo signals of the far-end audio data do not interfere the local audio data in a subsequent audio mixing stage to cause double talk, and the situation that local audio sources drop words, are blocked or leak echoes is not caused in audio streams which are pushed to a live room according to audio mixing of the intermediate audio data and the local audio data.
And secondly, the application uses the corresponding loop-back delay value of the local machine to carry out time delay compensation on the local audio data, so that the local audio data is mixed with the intermediate audio data according to the time delay specified by the loop-back delay value to obtain mixed audio data.
In addition, the application can be applied to the application scenes with the online conversation property, including but not limited to the scenes such as network video live broadcast, karaoke and the like, under the scenes, based on the audio stream obtained after the processing of the application, the listener can obtain the sound effects of synchronous playing and smooth sound of various sound sources, the perception from the perspective of the listener is ensured, the sound effects of keeping synchronous sound production of singers, lyrics in videos and background music can be perceived, and therefore, the user experience is improved.
Drawings
The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a generalized schematic block diagram of a technical architecture for processing audio data at a terminal device, which is also applicable in various embodiments of the present application;
fig. 2 is a flow chart of a process of echo cancellation of real-time audio data at a terminal device in the prior art, where the process of echo cancellation is shown as music echo and audience echo are cancelled;
fig. 3 is a waveform diagram of various actually measured voice signals after processing the voice signals in the network live connection state by applying the prior art, wherein the signals are respectively a main broadcasting signal, a music signal, a collection signal and an echo-cancelled signal from top to bottom;
FIG. 4 is a flow chart of an exemplary embodiment of a method for processing live audio according to the present application;
fig. 5 is a flow chart illustrating a process of obtaining a remote live stream and playing a local audio data in a mixed manner according to an embodiment of the present application;
FIG. 6 is a flow chart of echo cancellation for real-time audio data according to an embodiment of the present application;
FIG. 7 is a flow chart illustrating the delay compensation according to the loopback delay value according to the embodiment of the present application;
FIG. 8 is a flowchart illustrating a process of acquiring a loopback delay value by itself according to an embodiment of the present application;
fig. 9 is a schematic diagram of a basic principle of calculating a loop-back delay value of an audio signal, wherein T1 to T4 represent different moments, an acquisition thread is responsible for audio data acquisition, a playing thread is responsible for conversion and playing of audio data, and a host indicates an external input sound source, and real-time audio data can be acquired correspondingly;
FIG. 10 is a flow chart of a process for detecting a loop-back identification signal by tracking noise signals in real-time audio data according to an embodiment of the present application;
fig. 11 is a flowchart of the process of echo cancellation of real-time audio data at a terminal device according to the present application, in which only the audience echo is cancelled but not the music echo during the echo cancellation process;
FIG. 12 is a functional block diagram of a loop-back delay value to delay compensate for playing music of local audio data;
fig. 13 is a waveform diagram from top to bottom before and after time delay compensation by applying the technical scheme of the application under the same scene;
fig. 14 is a waveform diagram of various actually measured voice signals obtained after the relevant voice signals are processed by the technical scheme of the application under the condition that the anchor user is not connected with other users, wherein the signals are the anchor signal, the music signal, the signal after echo cancellation before improvement and the signal after echo cancellation after improvement respectively from top to bottom;
fig. 15 is a waveform diagram of various actually measured voice signals after processing related voice signals by applying the technical scheme of the present application in the connection state of the anchor user and other users, wherein the signals are from top to bottom, namely, anchor signal, audience signal, music signal, signal after echo cancellation before improvement and signal after echo cancellation after improvement;
FIG. 16 is a functional block diagram of an exemplary embodiment of a network live audio processing device of the present application;
fig. 17 is a schematic structural diagram of a computer device used in the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As used herein, "client," "terminal device," and "terminal device" are understood by those skilled in the art to include both devices that include only wireless signal receivers without transmitting capabilities and devices that include receiving and transmitting hardware capable of two-way communication over a two-way communication link. Such a device may include: a cellular or other communication device such as a personal computer, tablet, or the like, having a single-line display or a multi-line display or a cellular or other communication device without a multi-line display; a PCS (Personal Communications Service, personal communication system) that may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant ) that can include a radio frequency receiver, pager, internet/intranet access, web browser, notepad, calendar and/or GPS (Global Positioning System ) receiver; a conventional laptop and/or palmtop computer or other appliance that has and/or includes a radio frequency receiver. As used herein, "client," "terminal device" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or adapted and/or configured to operate locally and/or in a distributed fashion, at any other location(s) on earth and/or in space. As used herein, a "client," "terminal device," or "terminal device" may also be a communication terminal, an internet terminal, or a music/video playing terminal, for example, a PDA, a MID (Mobile Internet Device ), and/or a mobile phone with music/video playing function, or may also be a device such as a smart tv, a set top box, or the like.
The application refers to hardware such as a server, a client, a service node, and the like, which essentially is an electronic device with personal computer and other functions, and is a hardware device with necessary components disclosed by von neumann principles such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, and the like, wherein a computer program is stored in the memory, and the central processing unit calls the program stored in the memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing specific functions.
It should be noted that the concept of the present application, called "server", is equally applicable to the case of server clusters. The servers should be logically partitioned, physically separate from each other but interface-callable, or integrated into a physical computer or group of computers, according to network deployment principles understood by those skilled in the art. Those skilled in the art will appreciate this variation and should not be construed as limiting the implementation of the network deployment approach of the present application.
One or more technical features of the present application, unless specified in the clear, may be deployed either on a server for implementation and the client remotely invokes an online service interface provided by the acquisition server for implementation of the access, or may be deployed and run directly on the client for implementation of the access.
The various data related to the present application, unless specified in the plain text, may be stored either remotely in a server or in a local terminal device, as long as it is suitable for being invoked by the technical solution of the present application.
Those skilled in the art will appreciate that: although the various methods of the present application are described based on the same concepts so as to be common to each other, the methods may be performed independently of each other unless specifically indicated otherwise. Similarly, for the various embodiments disclosed herein, all concepts described herein are presented based on the same general inventive concept, and thus, concepts described herein with respect to the same general inventive concept, and concepts that are merely convenient and appropriately modified, although different, should be interpreted as equivalents.
The various embodiments of the present application to be disclosed herein, unless the plain text indicates a mutually exclusive relationship with each other, the technical features related to the various embodiments may be cross-combined to flexibly construct a new embodiment as long as such combination does not depart from the inventive spirit of the present application and can satisfy the needs in the art or solve the deficiencies in the prior art. This variant will be known to the person skilled in the art.
The network live audio processing method can be applied to terminal equipment in an offline environment and terminal equipment in a scene supported by a server for instant communication, wherein the scene comprises, but is not limited to, various exemplary application scenes such as instant messaging, network video live broadcasting, online customer service, karaoke and the like, and has a wide application range as a basic technology of audio processing. The method may be programmed to be implemented as a computer program product deployed for execution in a terminal device.
Referring to fig. 4, and referring to fig. 1, in an exemplary embodiment of the network live audio processing method of the present application, the method includes the following steps:
step S1100, obtaining far-end audio data in a live broadcasting room connection state, mixing the far-end audio data with local audio data to be play-out audio data, and playing the play-out audio data:
in a live broadcasting room of network video live broadcasting, two or more users can be started to connect, the users can be anchor users, namely owners of two live broadcasting room instances, or the anchor user of one live broadcasting room instance can establish a connection with audience users in the live broadcasting room, and one anchor user starts a multi-user connection with other anchor users and/or audience users in the live broadcasting room to enter a multi-user connection state for example, for the anchor user, through a data communication link established between the anchor user and other opposite users, the live broadcasting stream of the opposite party is pulled to a local terminal device to be played, and meanwhile, the live broadcasting stream of the anchor user is pushed to each opposite party through a server, and is also synchronously pushed to terminal devices of other audience users in the live broadcasting room to be played. Thus, for the anchor user, any live stream of other users obtained via the data communication link by pushing streams via the server is a remote live stream, while the live stream pushed by itself is a near-end live stream.
The live stream generally comprises a video stream and an audio stream, and the two paths of streaming media can be separated and extracted at a terminal device for decoding and outputting respectively. Correspondingly, the audio data extracted from the far-end live stream form far-end audio data, and the audio data in the local generated near-end live stream are near-end audio data.
The remote audio data is relatively simple, is generated by processing and mixing real-time audio data collected by a remote user in local terminal equipment thereof, and can be regarded as a single sound source when reaching the local terminal equipment of a host user. When there are multiple remote audio sources, in theory, there are multiple remote audio sources, but in practice, when echo cancellation is performed later in the present application, multiple remote users can also be regarded as one aggregated remote audio source to perform centralized echo cancellation, which is not shown here.
The near-end audio data is generated by mixing audio data generated by a plurality of sound sources for terminal equipment of a host user. For example, when the local terminal device is playing background music, a local playing source is formed, the background music may be local music text or audio data pulled from a remote media server, and in summary, the local terminal device will generate audio data corresponding to the local playing source, i.e. local audio data. In addition, if the input device such as the microphone of the local terminal device is in a working state, the sound card of the local terminal device will collect the externally input voice signal to generate corresponding audio data, namely, the real-time audio data obtained by the continuous collection of the local terminal device. Thus, it can be seen that when the anchor user plays music and turns on the voice input device, the local terminal device can obtain the audio data corresponding to the two audio sources. The remote user may also be considered as at least one remote audio source if the remote user is now in a wired state.
The local terminal device performs different processing on different sound sources, referring to fig. 1, and is respectively responsible for processing the output and collection of the sound signals by enabling two threads, namely a playing thread and a collection thread.
In the playing thread, the local terminal equipment calls back the audio data of the local playing source and mixes the audio data with the remote audio data to obtain the play audio data, and converts the play audio data into a voice signal and outputs the voice signal. It can be understood that under the condition that all the sound sources coexist, the played sound at least comprises background sound corresponding to the audio data of the local playing source and sound corresponding to the far-end audio data of the far-end sound source. The process of converting the audio data generated by mixing into a speech signal and outputting the speech signal is not described in detail since it is within the technical scope of those skilled in the art.
In the acquisition thread, the local terminal device continuously obtains real-time audio data acquired by the local terminal device and converted from an external voice signal from the sound card, and under the condition that the external audio data is played through a loudspeaker such as a sound box, the real-time audio data can theoretically encapsulate various signals contained in the voice signal, including a speaker voice signal, an echo signal which loops back after the local audio data is played, an echo signal which loops back after the remote audio data is played, and the like. According to conventional practice, the real-time audio data is subjected to echo cancellation processing, and is subjected to echo cancellation and then mixed with the local audio data, so that mixed audio data is obtained and then is contained in a live stream as a near-end audio stream to be pushed to each far-end user, and in an exemplary application scenario of a live broadcast room, the far-end user includes both a connection user in a connection state with a current anchor user and other audience users in other live broadcast rooms, except settings.
The above process disclosed by the application is mainly exemplified by the side of a host user in a live broadcast room, and it can be understood that the process can be equally applicable to other remote users in a connection state, and both sides can be symmetrically arranged and the technical schemes of the embodiments of the application are applied. In this regard, those skilled in the art will appreciate.
Step S1200, performing echo cancellation on the real-time audio data acquired by the local unit by using the remote audio data as a reference signal, to obtain intermediate audio data, where the echo signal of the remote audio data is cancelled and the echo signal of the local audio data is reserved:
in the echo cancellation link of the process of the acquisition thread shown in fig. 1, the application processes the real-time audio data differently from the prior art. The method is mainly characterized in that in the application, an adaptive filtering algorithm is applied, the echo cancellation processing is carried out on the real-time audio data by taking the remote audio data as a reference signal, so that echo signals generated by the loop back of the remote audio data after the remote audio data are correspondingly filtered out of the real-time audio data are played outside the local terminal equipment, but the echo signals generated by the loop back of the local audio data are reserved because the local audio data are not referenced.
The basic principle of the echo cancellation is that an adaptive filter is used for carrying out parameter identification on an unknown echo channel, an audio signal model of audio data which causes echo is established according to the correlation between a loudspeaker signal and multiple generated echoes as a basis, an echo path is simulated, the impulse response of the echo is approximated to a real echo path through the adjustment of the adaptive algorithm, and then an estimated value is subtracted from a voice signal received by a microphone through the real-time audio data, so that the echo cancellation function can be realized. It follows that the reference signal referenced by this will have a decisive influence on the correlation of the echo cancellation when modeling the audio signal to simulate the echo path. In particular, if there are multiple echo signals in the real-time audio data, the cancellation of these echo signals depends on the provided reference signal. In the application, the far-end audio data is taken as the reference signal, and finally the echo signal caused by the far-end audio data is eliminated. According to this principle, almost all existing adaptive filtering algorithms can be adapted to achieve the elimination of echo signals of far-end audio data from said real-time audio data.
The adaptive filtering algorithm can adopt various algorithms known in the prior art, and only needs to control the voice signal which constrains the corresponding reference signal into the far-end audio data:
incidentally, in 2007, valin, jean Marc proposed a scheme of "adjustment of learning rate in double-talk frequency domain echo cancellation" for double-talk optimization:
Valin,Jean Marc."On Adjusting the Learning Rate in Frequency Domain Echo Cancellation With Double-Talk."IEEE Transactions on Audio Speech&Language Processing 15.3(2007):1030-1034。
in the above solution, valin uses a dynamically calculated learning factor to address double talk. The method is specifically realized in that when double-talk occurs, the estimated echo signal and the error signal after adaptive filtering have small correlation, so that the adaptive learning factor is small, and the adaptive filtering coefficient is updated slowly. By adopting the method, the divergence of the filter coefficient under the double-talk condition is avoided, and the echo can be better eliminated under the condition that the adaptive filter coefficient is converged, but the situation that the filter coefficient is not converged is not ideal, for example, in the live broadcast scene mentioned by the application, under the condition that the local playing accompaniment generates the voice signal corresponding to the local playing source and the host singing generates the speaker voice signal, the double-talk can be always performed, the filter coefficient is updated slowly under the condition that the filter coefficient is not converged, and thus the situations of word dropping, clamping, echo leakage and the like can occur.
After the technical scheme of the application is applied and the far-end audio data is used as a reference signal, the filter system is quickly converged under the same scene, the double-talk situation is greatly improved, and the situations of word dropping, blocking, echo leakage and the like are avoided.
In some embodiments, when there are a plurality of remote audio data corresponding to a plurality of connected users, all the remote audio data may be treated as one remote audio data.
After the echo cancellation processing in the step, the intermediate audio data is obtained, and the echo signal generated after the local audio data is played still remains in the intermediate audio data is purposely obtained, and the echo signal is reserved in the mixing link of the acquisition thread and overlapped with the local audio data after the delay processing, so that the tone quality protection of the voice signal of the local audio data is realized.
Step S1300, mixing the local audio data with the intermediate audio data after superimposing the local loop-back delay value, and obtaining mixed audio data:
referring to fig. 1, in this step, the local audio data may be compensated for time delay by using a pre-obtained loop-back delay value of the local terminal device, so that the local audio data mixes with the intermediate audio data in time domain according to the loop-back delay value, thereby implementing synchronous alignment in time domain between a speech signal encapsulated by the local audio data and an echo signal encapsulated by the intermediate audio data, and avoiding bad interference to the local audio data caused by the echo signal of the local audio data as much as possible, and implementing protection to sound quality of the speech signal of the local audio data. After mixing, mixed audio data can be obtained, wherein the voice signal corresponding to the local audio data, the voice signal (if any) of the speaker sound and the echo signal generated after the playing of the local audio data are packaged, but all the three have realized alignment in the time domain.
The loop-back delay value can be the intrinsic data calculated for the local terminal equipment in advance, and can be directly called, or can be the loop-back delay value calculated by the method. In respect of the latter, corresponding embodiments will be given in the subsequent examples of the application to further enrich the inventive aspects of the application.
Step S1400, pushing a live stream containing the mixed audio data to the live studio:
when the mixed audio data is generated, the mixed audio data can be used as an audio stream in the live stream according to processing logic of a terminal in a live broadcasting room, and the mixed audio data is pushed to the online user in an online state and audience users in the live broadcasting room. After receiving the live stream, the corresponding user extracts the audio stream from the live stream to play, so that each channel of sound source can be perceived to be synchronous in time domain.
In a further exemplary application scenario, in a network living broadcast room, a terminal device where a host user is located is playing songs, background music including accompaniment is displayed in a video stream of the background music when the background music is played, and the host user in the network video living broadcast room is focusing on a microphone to concentrate on lyrics to guide high songs, and audio data corresponding to the background music is local audio data, which is not only locally played but also transmitted to a server supporting the operation of the living broadcast room to be pushed to a connecting user and a viewer user in the living broadcast room to listen. In the process, the connected user speaks occasionally to generate corresponding remote audio data, and the remote audio data is mixed with the local audio data to form audio-out audio data for playing.
The live stream pushed to the live broadcasting room comprises mixed audio data generated by an acquisition thread of the local terminal equipment, and the mixed audio data is used as an audio stream in the live broadcasting stream for transmission. Because the echo signals corresponding to the remote audio data are removed from the audio data, and the voice signals in the local audio data subjected to time delay compensation are protected by the echo signals caused by the local audio data, the audio stream of the live stream overcomes the problem of double talk, when the audio stream is played by a remote user, the voice signals corresponding to all the audio sources are synchronous, and under the condition of singing the lyrics, the voice of the host user and the lyrics caption in the video stream can be kept synchronous when the host user sings, and even if other connecting users speak, the voice data of the remote end are generated, the situation of word dropping, blocking, echo leakage and the like of the audio stream pushed by the local terminal device is not caused after the audio stream is played.
According to an alternative embodiment of the present exemplary embodiment, an echo cancellation function switch set by a host user may be provided in an application program of the terminal device, and the user may decide whether to start "double talk" optimization by setting the switch, so that when the technical solution of the present application starts, the state of the echo cancellation function switch of the user in the live broadcasting room logged in by the user is detected, and when the state is in an active state, a subsequent step of the present application is executed, otherwise, a subsequent step is not executed.
It will be appreciated from the principle disclosure of the present exemplary embodiment that the technical solution of the present application can achieve positive effects far superior to those of the prior art, including but not limited to the following aspects:
firstly, in the network live connection state, echo cancellation is carried out in real-time audio data acquired by a local machine, local audio data and far-end audio data are decoupled in an echo cancellation link, only the far-end audio data are selected as reference signals in the echo cancellation process, echo signals generated by the fact that the far-end audio data are externally put in the local machine in the real-time audio data are cancelled, echo signals generated by the fact that the local audio data are broadcast are reserved, and intermediate audio data are obtained, therefore, the echo signals of the far-end audio data do not interfere the local audio data in a subsequent audio mixing stage to cause double talk, and the situation that local audio sources drop words, are blocked or leak echoes is not caused in audio streams which are pushed to a live room according to audio mixing of the intermediate audio data and the local audio data.
And secondly, the application uses the corresponding loop-back delay value of the local machine to carry out time delay compensation on the local audio data, so that the local audio data is mixed with the intermediate audio data according to the time delay specified by the loop-back delay value to obtain mixed audio data.
In addition, the application can be applied to the application scenes with the online conversation property, including but not limited to the scenes such as network video live broadcast, karaoke and the like, under the scenes, based on the audio stream obtained after the processing of the application, the listener can obtain the sound effects of synchronous playing and smooth sound of various sound sources, the perception from the perspective of the listener is ensured, the sound effects of keeping synchronous sound production of singers, lyrics in videos and background music can be perceived, and therefore, the user experience is improved.
Referring to fig. 5, in a deepened embodiment, the step S1100 of obtaining the remote audio data in the live broadcasting room connection state, mixing the remote audio data with the local audio data, and playing the mixed audio data includes the following steps:
step S1110, obtaining a remote live stream pushed by a server in a live room connection state:
in the application scene of the network video live broadcast, a host user and other users start connection and enter a connection state, and under the support of a server supporting the live broadcast room operation, both sides can acquire live streams of the other side and can transmit the live streams generated by the host user to the other side.
Step S1120, extracting the far-end audio data from the far-end live stream:
For the local terminal device, the live stream needs to be parsed and output, so that the remote live stream can be parsed in advance to obtain the video stream and the audio stream therein, and the video stream and the audio stream are correspondingly output.
The audio stream encapsulates the far-end audio data generated by the far-end user, and the far-end audio data can be generated by applying the technical scheme of the application to the terminal equipment of the far-end user.
Step S1130, mixing the remote audio data with the local audio data to obtain play audio data:
as can be further understood by combining the workflow of the playing thread in fig. 1, the remote audio data is mixed with the local audio data corresponding to the background music being played by the local terminal device, so as to obtain corresponding playing audio data.
Step S1140, converting a voice signal according to the play-out audio data to play the voice signal:
finally, according to the traditional audio playing technology, the audio playing data are converted into voice signals, and then the voice signals are played through a loudspeaker.
The embodiment further shows various beneficial effects obtained by the application through the disclosure of the network video live broadcast scene, and it can be understood that in the network video live broadcast scene, the probability of double talk caused by each communication party is higher, and after the technical scheme of the application is applied, adverse factors caused by double talk are eliminated, so that the voice quality of communication of each party is further improved, the call quality of multiparty call is effectively ensured, and the user experience is improved.
Referring to fig. 6, in a deepened embodiment, the step S1200 of performing echo cancellation on the real-time audio data collected by the local unit by using the far-end audio data as a reference signal to obtain intermediate audio data includes the following steps:
step S1210, continuously collecting real-time input voice signals from the local sound card to obtain real-time audio data:
the collection of the voice signal of the external input sound source is generally realized through the sound card of the local terminal equipment, the voice signal is continuously collected through the sound card, digital-to-analog conversion is carried out on the voice signal to form a corresponding voice frame, and corresponding audio data are assembled, so that the person skilled in the art can know the voice frame.
Step S1220, applying a preset adaptive echo filtering algorithm, and performing echo cancellation processing on the real-time audio data by using the far-end audio data as a reference signal, so as to cancel echo signals corresponding to the far-end audio data:
referring to the description of the exemplary embodiments of the present application, the echo cancellation processing is implemented on the real-time audio data by applying a preset adaptive echo filtering algorithm, such as an algorithm implemented by an AEC or other neural network model, where the far-end audio data in the far-end live stream is referred to as a reference signal for echo cancellation, and only echo signals generated after the far-end audio data is locally played are cancelled. When multiple paths of far-end live broadcast streams exist, all far-end audio signals of all the far-end live broadcast streams can be regarded as the same path of far-end audio signals to carry out echo processing.
When programming is realized, a unified algorithm is adopted. Because the self-adaptive mechanism is applied, when the far-end audio data does not exist, the reference signal in the algorithm is 0, so that the problem of eliminating the echo of the far-end audio data does not exist, and the self-adaptive advantage is further reflected.
Step S1230, reserving echo signals corresponding to the local audio data in the real-time audio data as the intermediate audio data:
after the echo cancellation processing, the echo signal corresponding to the local audio data in the real-time audio data is still remained in the intermediate audio data, so that the hidden danger of double-talk is eliminated, and the echo signal can still be used for protecting the tone quality of the local audio data.
In this embodiment, by deepening the echo cancellation process, the elasticity of the adaptive filtering algorithm applied by the present application is revealed, and whether the current terminal device is in a connection state or not, under the action of the adaptive mechanism, the audio stream pushed by the local terminal device can be ensured to obtain better sound quality.
Referring to fig. 7, in a specific embodiment, the step S1300 of mixing the local audio data with the intermediate audio data after superimposing the local loop-back delay value to obtain mixed audio data includes the following steps:
Step S1310, obtaining a loopback delay value corresponding to the local device:
the loop-back delay value of a terminal device is usually determined by the hardware of the terminal device, and the loop-back delay caused by the hardware accounts for a large proportion, so that the loop-back delay value of a terminal device can be calculated in advance and directly called in the step.
Step S1320, controlling the local audio data to mix with the intermediate audio data according to the loop-back delay value lag, so as to obtain mixed audio data:
referring to fig. 1, local audio data is subjected to hysteresis processing according to the loopback delay value in the time domain, and then is mixed with the intermediate audio data obtained after the echo cancellation processing, thereby obtaining the mixed audio data.
This example further reveals the superposition process of the loop-back delay values, detailing the corresponding implementation, and facilitating guidance to those skilled in the art for implementation.
Referring to fig. 8, in a further embodiment, the step S1310 of obtaining a loopback delay value corresponding to the local device includes the following steps:
step S2100, presetting a loop-back identification signal to the audio-out data at a first time, where the loop-back identification signal is a high-frequency signal outside the hearing band of the human ear, and includes a plurality of single-frequency signals, where each single-frequency signal is set at equal intervals in the frequency domain:
In order to detect a delay value generated by looping back after the local audio data is externally placed by the terminal equipment, namely the looping back delay value, in the service logic of the playing thread, a looping back identification signal is preset for the local audio data, and then the time length of the echo signal formed by playing the looping back identification signal is detected in real-time audio data through the acquisition thread, so that the time length can be determined as the looping back delay value and used as an acquisition side for correcting the audio data to realize the audio data alignment of multiple paths of audio sources.
Since the loopback delay value mainly depends on the hardware performance of the terminal equipment, the loopback delay value of the same terminal equipment is relatively fixed, and therefore, after the loopback delay value is obtained, the loopback delay value can be stored locally to the terminal equipment and can be directly invoked later.
The loop-back identification signal is customized by the application, and is constructed into a signal with certain regularity and uniqueness, so that the signal is convenient to distinguish with the signal content corresponding to the local audio data. In the present application, the loop-back identification signal is defined as an out-of-band high-frequency signal of a frequency band corresponding to the human ear hearing range. The frequency that the human ear can receive is between 20Hz-20000Hz, and the out-of-band signal is unconscious to the human ear, which varies from person to person. Therefore, the application selects the high-frequency signal outside the human ear hearing frequency band as the loop-back identification signal, fully considers the requirement of tone quality control, does not destroy the voice information content of each path of sound source, and does not cause trouble to users.
In a preferred embodiment of the present application, the loop-back identification signal is configured to include a plurality of single-frequency signals, and each single-frequency signal is disposed at equal intervals in a frequency domain, for example, two adjacent single-frequency signals may be separated by one, two or three sampling resolution units. By adopting the mode, the detection is relatively easy, the corresponding detection algorithm has high operation efficiency and takes up little time.
In an alternative embodiment, the single-frequency signals of the loop-back identification signals may also be arranged at equal ratio intervals, and the loop-back identification signals may also have other forms of construction, so long as the loop-back identification signals are detected in the service logic of the acquisition side by using a corresponding algorithm.
A more specific and alternative embodiment suitable for use in the present application for constructing the loop-back identification signal is given below:
first, the loop-back identification signal is constructed:
in this embodiment, the loop-back identification signal is a plurality of single-frequency signals at a high frequency, for example, three single-frequency signals, and frequency points of each single-frequency signal are outside a hearing frequency band perceivable by human ears, as follows:
f in 0 ,f 1 ,f 2 Frequencies corresponding to three single frequency signals of the loop-back identification signal, wherein:
f 1 =f 0 +2Δf
f 2 =f 0 +4Δf
Δf is the frequency resolution, i.e., the current audio sampling rate T s In the case where the number of DFT (discrete Fourier transform) points is Nα 0 For looping back the amplitude of the identification information, alpha is exemplified by 16bit quantized audio 0 The reference value is 8192. The single frequency signal duration is typically 200ms.
In a preferred embodiment, in view of the existence of an audio loop-back path, in order to better detect the added single frequency signal when detecting on the acquisition side, a mute signal ref_silence (t) of for example 2 seconds may be added before adding the loop-back identification signal on the playback side in order to delay the detection on the acquisition side. To this end, in the play-side service logic, the loopback identification signal may be expressed as:
assuming that the audio data of the local playing source to be played by the terminal equipment is render (t), the render (t) and the loopback identification signal ref 0 And (t) performing audio mixing playing, namely completing the operation of adding the loop-back identification signal during real-time playing, wherein the corresponding moment is marked as a first moment.
Step S2200, detecting whether the loop-back identification signal exists in the real-time audio data, and determining a second moment when the loop-back identification signal is detected:
the loop-back identification signal constructed by the application is more characterized and is easier to detect, so that the loop-back delay value can be determined efficiently and accurately, and therefore, after the mixed audio data obtained by the acquisition thread is converted into a voice signal to be played, the synchronization and coordination among different sound sources can be perceived obviously in hearing, and the tone quality is better.
In this embodiment, in order to detect the loopback identifying signal, playing of the local audio data is suitably performed in a form of external playing, so that after the real-time audio data is collected by the collection thread, the loopback identifying signal is obtained from the echo signal of the local audio data in the real-time audio data. Of course, as described above, once the loop-back delay value is determined and saved, subsequent direct calls may be made without additional computation.
In an extended embodiment, in order to improve the success rate of detecting the loopback identification signal, the implementation environment of the present application may be detected in advance, and specifically, the method may be performed according to the following steps:
1) Detecting whether the terminal equipment is in an audio data acquisition state and an outward playing state at the same time, wherein the outward playing state means that a local terminal equipment starts a loudspeaker which can be convenient for acquiring echo signals, such as a sound box, and the loudspeaker is not used for speaking through equipment such as an earphone;
2) Detecting whether sampling activity performed by the terminal device is performed at a preset sampling rate, for example, a sampling rate preset to 44.1kHz or 48 kHz;
3) And detecting whether the terminal equipment has an audio loop path corresponding to the local audio data, and if so, presetting a loop-back identification signal and detecting the loop-back identification signal.
Through the pre-detection, when corresponding conditions are confirmed to be satisfied, namely the terminal equipment is in an acquisition state and an outward-playing state at the same time, sampling is carried out at a preset sampling rate, and an audio loop-back path is formed through outward-playing, and in this case, the process of detecting the loop-back identification signal can be implemented.
After the loop-back identification signal is played by the external thread along with the local audio data, the acquisition thread starts to detect the loop-back identification signal in real time. It will be appreciated that for the detection of the loop-back identification signal, corresponding algorithms are designed to correspond to the features of the loop-back identification signal, although the specific implementation of these corresponding algorithms may be various, so as to accurately identify the features of the loop-back identification signal and determine the fact that the loop-back identification signal exists.
After the loop-back identification signal has been detected by means of these corresponding algorithms, the time at which it was detected can be determined and marked as the second moment.
Step S2300, determining the loopback delay value according to the difference between the first time and the second time:
referring to fig. 9, the speaker of the anchor sends out at time T2 and is collected at time T3, and the first time (T1) of presetting the loop-back identification signal is the time (T1) of adding the loop-back identification signal to the local audio data to be outputted from the playing side, and the second time (T3) is the time of detecting the loop-back identification signal from the collecting side, so that the interval between the time (T3) of collecting the real-time audio data and the time (T4) of mixing the collection thread is very small, and can be ignored. Therefore, the difference between the second time T3 and the first time T1 is the actual time when the loop-back identification signal completes the audio loop-back path, and therefore, the difference between the second time and the first time is determined as the loop-back delay value.
Step S2400, storing the loopback delay value for subsequent direct calling:
after the loop-back delay value is obtained, the loop-back delay value is stored in the local terminal equipment for subsequent calling.
In this embodiment, the construction manner of the loopback identifying signal and various alternative embodiments thereof are disclosed in an exemplary deepening manner, and the loopback identifying signal is defined by constructing a plurality of single-frequency signals which are equally spaced in the frequency domain, so that the loopback identifying signal presents more personalized characteristics and is easier to identify. The loop-back identification signal is outside the human ear perceivable frequency band, so that human ear perceivable interference is not constructed on the sound of the sound source. In addition, the loop-back identification signal fully considers the audio loop-back requirement, and a mute signal is preset before the loop-back identification signal, so that the single-frequency signal is delayed to the mute signal, the acquisition side is ensured to have enough time to wait for the loop-back identification signal to appear, the success rate of detecting the loop-back identification signal is further improved, the loop-back delay value is conveniently and efficiently determined, the acquisition thread is guided to realize aligned mixing of the local audio data and the echo signal thereof, and the tone quality protection of the local audio data is ensured.
Referring to fig. 10, in a deepened embodiment, the step S2200 of detecting whether the loop-back identification signal exists in the real-time audio data includes the following steps:
step S2210, tracking a noise signal of the real-time audio data along a time domain, and transforming the noise signal to a frequency domain to obtain corresponding noise energy spectrum data:
as described above, since the loop-back identification signal is configured as a high-frequency signal and is easily interfered by the high-frequency signal, it is necessary to track the noise signal in the real-time audio data on the acquisition side in order to prevent the adverse effect of the high-frequency noise on the detection of the loop-back identification signal. The noise signal tracking can be realized by a person skilled in the art by means of various common algorithms, in the application, an MCRA series algorithm, especially an IMCRA algorithm, is recommended to be used for tracking the frequency point noise, and the IMCRA algorithm is an algorithm for tracking the minimum value of the frequency point in the time domain and is known to the person skilled in the art.
To facilitate reference to the energy of the noise signal for reference, it is necessary to transform the noise signal in the real-time audio data to the frequency domain and obtain corresponding noise energy spectrum data. The more common algorithm for calculating the frequency domain energy is to use an FFT (fourier transform) algorithm, and the FFT algorithm calculates the energy value of the entire frequency domain frequency point, so the embodiment can apply the FFT algorithm to convert the noise energy spectrum data corresponding to the noise signal.
In the embodiment optimized based on the present embodiment, considering that the loop-back identification signal is flexibly customized in the foregoing embodiment, only includes a plurality of single-frequency signals, and therefore, only energy values corresponding to a plurality of frequency points added on the playing side need to be calculated, for example, energy values corresponding to 3 frequency points disclosed in the foregoing embodiment, in this case, the use of FFT occupies a significant algorithm complexity, and the FFT complexity is 0 (NIogN). Therefore, in the optimized embodiment, it is recommended to use a Goertzel algorithm (a guerre algorithm) to calculate the frequency domain energy, where the algorithm complexity is 0 (N), so that the algorithm complexity is greatly reduced, and the estimated frequency point noise is expressed as: lambda (f) i ),i=0,1,2。
It will be appreciated from the disclosure herein that one skilled in the art can transform the noise signal from the time domain to the frequency domain in a variety of ways to obtain corresponding noise energy spectrum data.
Step S2220, locating peak positions of the frequency points according to the voice energy spectrum data mapped by the voice frames of the real-time audio data:
accordingly, each speech frame in the real-time audio data is also transformed from the time domain to the frequency domain using a fourier transform algorithm to obtain speech energy spectrum data in order to determine its corresponding energy value. On the basis, the loop-back identification signal can be detected according to the noise energy spectrum data and the voice energy spectrum data.
Adapted to the aforementioned exemplary configuration of the loop-back identification signal, when ref_s is added on the playback side 0 And (t) when the signal is transmitted, the acquisition side starts to detect. Aiming at voice energy spectrum data mapped by voice frames in real-time audio data, an acquisition side firstly judges whether a current frequency point is the voice energy spectrum data or notPeak, denoted as P peak(i)
Wherein E (f) i ) The energy value of the current frequency point is calculated through a Goertzel algorithm, and the energy of the current detection frequency is the peak value of the upper frequency point and the lower frequency point because the added loop-back identification signal is separated by one delta f.
Step S2230, calculating the existence probability of the loop recognition signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point:
calculating the existence probability P of the loop back identification signal according to the energy of each frequency point and the background noise energy f(i) The following formula is used:
wherein E is max An energy threshold value for indicating the existence of a loop-back identification signal in the current frame, namely, when the energy value of the loop-back identification signal exceeds a preset energy threshold value, judging that the loop-back identification signal exists in a high degree of certainty in the corresponding voice frame, wherein the energy threshold value generally takes E max (f i )=1.8*E(f i ) According to the actual situation, the device can be properly and flexibly adjusted. log E (f) i )-logλ(f i ) It can be understood as the signal-to-noise ratio of the current frequency bin.
One frame signal satisfies P peak(i) When=1, it is possible to be the loop-back identification signal, so the existence probability of the loop-back identification signal of the current frame can be obtained by combining the above two features:
wherein W is a weight, claimW is taken in general f(i) =1/3,i =0, 1,2. In particular, if some devices acquire a weak signal at a high frequency, the weight at that frequency may be adjusted appropriately.
Step S2240, determining that the loop-back identification signal is detected when the existence probabilities of a plurality of continuous speech frames satisfy a preset condition:
in an exemplary embodiment, the play-side added loopback identification signal is maintained in the time domain for 200ms, so that a time domain factor is introduced during detection at the acquisition side, and the following formula is defined for the inter-frame probability retention IPP (Interframe Probability Persistence):
where T is the number of frames that the inter-frame probability holds, and it is generally assumed that the detection of the loop-back identification signal is determined when t=3, ipp=1t=6, ipp=0.8. At this time, the time when the loop-back identification signal is detected for the first time, namely the second time, and ref_s is added for the first time 0 The time of (t), i.e., the first time, the time interval between the two times is the audio loop back delay.
In general, when the loopback identifying signal is detected, the playing side does not add the loopback identifying signal any more, and in the ideal case of actual measurement, the playing side needs to add 3 frames of loopback identifying signals. The signal is detected by the exemplary algorithm described above, with measured detection errors within 20 ms.
The embodiment shows that, in the process of starting the detection of the loopback identification signal, the detection of the loopback identification signal is performed according to the noise energy and the voice energy, wherein, by means of a probability estimation mode, a plurality of single-frequency signals are detected, and the loopback identification signal is detected through a plurality of voice frames in consideration of the inter-frame maintenance, the given algorithm is simple and easy to realize, the operation efficiency is high, and the detection error can be controlled within 20ms, so that the method is very accurate.
In a further extended embodiment, when the detection of the loopback identification signal reaches a preset number of times and all ends with failure, the following steps may be performed: in response to detecting the event of failure of the loop-back identification signal, reconstructing the loop-back identification signal and initiating a secondary detection, the reconstructed loop-back identification signal being a high frequency signal outside the hearing band of the human ear having a frequency lower than the frequency of the previously constructed loop-back identification signal.
The frequency of the reconstructed loop-back identification signal is controlled below the frequency point of the previous loop-back identification signal mainly for hardware reasons: the current terminal equipment has weak energy for collecting high-frequency signals. For this reason, the frequency in the loop-back identification signal, and even its amplitude, needs to be adjusted appropriately in order to be detected by the service logic on the acquisition side.
As an example, the reconstructed loop-back identification signal may be expressed as follows:
wherein:
generally, α 1 >α 0 ,f 1i <f i . In practice, alpha is generally taken 1 =10922,f 1i The frequency of (2) is around 19 KHz.
After the new loop-back identification signal is reconstructed, the detection procedure is re-executed according to the procedure disclosed in the related embodiment, so as to successfully detect the new loop-back identification signal by changing the loop-back identification signal, thereby successfully determining the loop-back delay value.
In order to more fully reveal the advantages of the technical scheme of the application, the relevant description of the beneficial effects of the technical scheme of the application is continuously given:
1. experimental data obtained by using the loop-back delay value calculation scheme provided by the application:
in an actual application scene, the audio loop-back delay is calculated in real time, no additional operation is needed for a user, the problem of alignment of voice and accompaniment in a remote karaoke 0K and live singing scene is guaranteed, and user experience in a network live broadcast process is improved. After the audio loop-back delay value is calculated in real time through the technical scheme, the alignment of the voice and accompaniment can be realized more strictly, and the method leads the experience of other similar products. The following table is a comparison of 5 audio loop-back delay values and manual measurement results calculated by implementing the technical scheme of the application on 4 mobile terminal devices:
In the table, android models 0p×r×n×3 and v× xpI ×6 are collected and played under media tones by using the JAVA API. As can be seen from the table, the detection error of the present application is controlled within 20 ms.
2. Description of the advantages achieved by applying the echo cancellation technique of the present application:
in the related application scene provided by the application, the signal which is finally needed to be pushed by the anchor user is the sound of the anchor user and the local play music after time delay compensation, so that the processing of the anchor sound and the music sound is decoupled and respectively processed to ensure that the two sound qualities are optimal, and the application is expressed as follows:
1) Protecting the anchor sound:
the anchor sound, if forming a double-talk with the playing music, results in poor sound quality. In order to eliminate frequent occurrence of a double-talk scene, the reference signal in the echo cancellation in the application does not contain locally played music, namely the local audio data, and double talk only occurs when a connecting user and a host user speak at the same time, thereby ensuring convergence of the adaptive filter. The process of optimizing the processing of the acquisition thread is shown in fig. 11.
Referring to fig. 11, for the split scenario, when a live broadcast stream is pushed by a host user, and background music is played in a local terminal device at the same time, and is not connected with any user, the playing signal is only a music signal, and for the scenario, the echo cancellation reference signal is 0, which is equivalent to that echo cancellation does not work, and the host sound is not lost; when a host user pushes a live stream, plays background music and connects with other users, the playing signal contains music signals and the sound of the connecting user, namely far-end audio data.
2) Protecting music signals:
after the scheme is adopted, the voice signal in the intermediate audio data after the echo cancellation contains the anchor sound and the music echo, the music echo has distortion compared with the original signal due to space reflection, and the music echo and the played music signal (local audio data) are mixed to make up for the distortion. The process flow is shown in fig. 12.
With reference to fig. 12, since the terminal device plays music and collects the music, if the delay compensation error is greater than 50ms, the audience can hear 2 paths of music sounds after mixing, which seriously affects the hearing feeling. In the non-connection state, since the mixing is that the host sound and the music sound are mixed, the error requirement of delay compensation is insensitive. In the wired state, the delay compensation requires an error of less than 50ms. Referring to fig. 13, after the time delay compensation scheme of the present application is applied, the difference between the compensated music signal and the original music signal is smaller, so that the push-stream music quality can be ensured.
3. Description of experimental data in comparison with the prior art:
when the anchor terminal issues live broadcast, plays music and is not connected with other users, the contrast effect before and after the anchor sound processing is shown in fig. 14. In fig. 14, signals before the improvement of the technical scheme of the present application, signals after the echo cancellation are processed, and signals after the improvement of the technical scheme of the present application are processed, wherein the signals are main broadcasting signals, music signals, signals after the echo cancellation are processed, and signals after the echo cancellation are processed. By comparison, obvious word drop can occur after echo cancellation of the anchor signal before improvement, and the improved voice signal is consistent with the original anchor signal, so that no voice loss in the processing process is ensured.
When the anchor terminal issues live broadcast and plays music and links with other users, the contrast effect before and after the anchor sound processing is shown in fig. 15. In fig. 15, the signals before the improvement of the technical scheme of the present application and the signals after the improvement of the technical scheme of the present application are the main broadcasting signals, the audience signals, the music signals, the signals after the echo cancellation from top to bottom. The echo cancellation reference signal before improvement is the superposition of the music signal and the audience signal echo, and the figure shows that the echo cancellation signal before improvement has word drop at 6s and audio loss; obvious leaky echoes appear at 8s, and the quality of the echo cancellation signal after improvement is obviously better than that before improvement.
Referring to fig. 16, the network live audio processing device provided by the present application is adapted to perform functional deployment by the network live audio processing method of the present application, and includes: the system comprises a pull stream playing module 1100, an echo cancellation module 1200, a loop back correction module 1300 and a live broadcast pushing module 1400, wherein the pull stream playing module 1100 is used for acquiring far-end audio data in a live broadcast room connection state, mixing the far-end audio data with local audio data to be play-out audio data and playing the play-out audio data; the echo cancellation module 1200 is configured to perform echo cancellation on real-time audio data acquired by the local unit by using the far-end audio data as a reference signal, so as to obtain intermediate audio data, where echo signals of the far-end audio data are cancelled and echo signals of the local audio data are reserved; the loop-back correction module 1300 is configured to superimpose the local audio data on a local loop-back delay value and mix the superimposed local audio data with the intermediate audio data to obtain mixed audio data; the live broadcast pushing module 1400 is configured to push a live broadcast stream including the mixed audio data to the live broadcast room.
In a further embodiment, the streaming play module 1100 includes: the link stream pulling sub-module is used for acquiring a remote live stream pushed by a server in a live broadcast room link state; an audio extraction sub-module, configured to extract far-end audio data from the far-end live stream; the multi-source audio mixing sub-module is used for mixing the far-end audio data with the local audio data to obtain play audio data; and the play output sub-module is used for converting the voice signal according to the play audio data to play.
In a further embodiment, the echo cancellation module 1200 includes: the real-time acquisition sub-module is used for continuously acquiring real-time input voice signals from the local sound card to obtain real-time audio data; the echo filtering sub-module is used for applying a preset self-adaptive echo filtering algorithm, taking the far-end audio data as a reference signal, and executing echo cancellation processing on the real-time audio data so as to cancel echo signals corresponding to the far-end audio data; and the intermediate acquisition sub-module is used for reserving echo signals corresponding to the local audio data in the real-time audio data as the intermediate audio data.
In an embodiment, the loopback correction module 1300 includes: the delay calculation sub-module is used for obtaining a loopback delay value corresponding to the local equipment; and the time delay compensation sub-module is used for controlling the local audio data to mix with the intermediate audio data according to the loop-back delay value hysteresis so as to obtain mixed audio data.
In a further embodiment, the delay calculation submodule includes: the loop-back presetting unit is used for presetting a loop-back identification signal to the audio-out frequency data at a first moment, wherein the loop-back identification signal is a high-frequency signal outside the hearing frequency band of the human ear and comprises a plurality of single-frequency signals, and the single-frequency signals are arranged at equal intervals on the frequency domain; the loop-back detection unit is used for detecting whether the loop-back identification signal exists in the real-time audio data, and determining a second moment when the loop-back identification signal is detected; a loop-back calculation unit, configured to determine the loop-back delay value according to a difference between the first time and the second time; and the loop-back storage unit is used for storing the loop-back delay value for subsequent direct calling.
In a further embodiment, the loopback detection unit comprises: the noise tracking subunit is used for tracking the noise signal of the real-time audio data along the time domain, and transforming the noise signal to the frequency domain to obtain corresponding noise energy spectrum data; the peak value positioning subunit is used for positioning the peak value position of each frequency point according to the voice energy spectrum data mapped by the voice frame of the real-time audio data; the probability estimation subunit is used for calculating the existence probability of the loopback recognition signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point; and the signal detection subunit is used for judging that the loop-back identification signal is detected when the existence probability of a plurality of continuous voice frames meets the preset condition.
In an extended embodiment, the operations that precede the echo cancellation module 1200 include: and the state detection module is used for detecting an echo cancellation function switch of a local logged-in live broadcast room user, and driving the echo cancellation module 1200 to work when the echo cancellation function switch is in an activated state, otherwise, stopping the operation of the echo cancellation module 1200.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. As shown in fig. 17, the internal structure of the computer device is schematically shown. The computer device includes a processor, a computer readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store a control information sequence, and when the computer readable instructions are executed by a processor, the processor can realize a network live broadcast audio processing method. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The memory of the computer device may store computer readable instructions that, when executed by the processor, cause the processor to perform the network live audio processing method of the present application. The network interface of the computer device is for communicating with a terminal connection. It will be appreciated by those skilled in the art that the structure shown in FIG. 17 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
The processor in this embodiment is configured to execute specific functions of each module and its sub-module in fig. 16, and the memory stores program codes and various data required for executing the above-mentioned modules or sub-modules. The network interface is used for data transmission between the user terminal or the server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the network live audio processing apparatus of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.
The present application also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the method for live audio processing of any of the embodiments of the present application.
The application also provides a computer program product comprising computer programs/instructions which when executed by one or more processors implement the steps of a method for live audio processing according to any of the embodiments of the application.
Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments of the present application may be implemented by a computer program for instructing relevant hardware, where the computer program may be stored on a computer readable storage medium, where the program, when executed, may include processes implementing the embodiments of the methods described above. The storage medium may be a computer readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
In summary, the application decouples the playing music from the audience sound in the network connection state, and the reference signal in the echo cancellation only selects the audience sound, thereby avoiding the situation of frequent double-talk, compensating the local music time delay for the music echo, and then carrying out the mixing to protect the music quality, so as to improve the voice playing effect in the network connection state.
Those of skill in the art will appreciate that the various operations, methods, steps in the flow, acts, schemes, and alternatives discussed in the present application may be alternated, altered, combined, or eliminated. Further, other steps, means, or steps in a process having various operations, methods, or procedures discussed herein may be alternated, altered, rearranged, disassembled, combined, or eliminated. Further, steps, measures, schemes in the prior art with various operations, methods, flows disclosed in the present application may also be alternated, altered, rearranged, decomposed, combined, or deleted.
The foregoing is only a partial embodiment of the present application, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims (6)

1. The network live broadcast audio processing method is characterized by comprising the following steps of:
acquiring far-end audio data in a live broadcasting room connection state, mixing the far-end audio data with local audio data to obtain play audio data;
performing echo cancellation on the real-time audio data acquired by the local machine by taking the far-end audio data as a reference signal to obtain intermediate audio data, wherein echo signals of the far-end audio data are cancelled and echo signals of the local audio data are reserved;
mixing the local audio data with the intermediate audio data after the local loop-back delay value is overlapped to obtain mixed audio data, wherein the method comprises the following steps: presetting a loop-back identification signal to the audio-out data at a first moment, wherein the loop-back identification signal is a high-frequency signal outside the hearing frequency band of the human ear and comprises a plurality of single-frequency signals, and the single-frequency signals are arranged at equal intervals on the frequency domain; detecting whether the loopback identifying signal exists in the real-time audio data, and determining a second moment when the loopback identifying signal is detected; determining the loopback delay value according to the difference between the first time and the second time; controlling the local audio data to mix with the intermediate audio data according to the loop-back delay value hysteresis so as to obtain mixed audio data;
Pushing a live stream containing the mixed audio data to the live broadcasting room;
the detecting whether the loop-back identification signal exists in the real-time audio data comprises the following steps: tracking a noise signal of the real-time audio data along a time domain, and transforming the noise signal to a frequency domain to obtain corresponding noise energy spectrum data; according to voice energy spectrum data mapped by voice frames of the real-time audio data, positioning peak positions of all frequency points; calculating the existence probability of the loopback recognition signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point; and judging that the loop-back identification signal is detected when the existence probability of a plurality of continuous voice frames meets the preset condition.
2. The network live audio processing method according to claim 1, wherein the step of obtaining the remote audio data in the live room connection state, mixing the remote audio data with the local audio data, and playing the mixed audio data, comprises the steps of:
acquiring a remote live stream pushed by a server in a live broadcasting room connection state;
extracting far-end audio data from the far-end live stream;
mixing the far-end audio data with the local audio data to obtain externally-placed audio data;
And converting the voice signal according to the external audio data to play.
3. The method for processing live audio according to claim 1, wherein echo cancellation is performed on the real-time audio data collected by the local machine by taking the far-end audio data as a reference signal to obtain intermediate audio data, comprising the steps of:
continuously acquiring real-time input voice signals from a local sound card to obtain real-time audio data;
applying a preset adaptive echo filtering algorithm, taking the far-end audio data as a reference signal, and executing echo cancellation processing on the real-time audio data to cancel echo signals corresponding to the far-end audio data;
and retaining echo signals corresponding to the local audio data in the real-time audio data as the intermediate audio data.
4. A method for processing live audio according to any one of claims 1 to 3, wherein the step of performing echo cancellation on the locally acquired real-time audio data using the far-end audio data as a reference signal to cancel an echo signal corresponding to the far-end audio data, and obtaining intermediate audio data, comprises the steps of:
And detecting an echo cancellation function switch of the user in the live broadcasting room logged in by the local machine, and starting a subsequent step when the echo cancellation function switch is in an activated state, otherwise, not executing the subsequent step.
5. A computer device comprising a central processor and a memory, characterized in that the central processor is arranged to invoke a computer program stored in the memory for performing the steps of the method according to any of claims 1 to 3.
6. A computer-readable storage medium, characterized in that it stores in the form of computer-readable instructions a computer program implemented according to the method of any one of claims 1 to 3, which, when invoked by a computer, performs the steps comprised by the corresponding method.
CN202111144000.5A 2021-09-28 2021-09-28 Network live broadcast audio processing method and device, equipment, medium and product thereof Active CN113938746B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111144000.5A CN113938746B (en) 2021-09-28 2021-09-28 Network live broadcast audio processing method and device, equipment, medium and product thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111144000.5A CN113938746B (en) 2021-09-28 2021-09-28 Network live broadcast audio processing method and device, equipment, medium and product thereof

Publications (2)

Publication Number Publication Date
CN113938746A CN113938746A (en) 2022-01-14
CN113938746B true CN113938746B (en) 2023-10-27

Family

ID=79277184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111144000.5A Active CN113938746B (en) 2021-09-28 2021-09-28 Network live broadcast audio processing method and device, equipment, medium and product thereof

Country Status (1)

Country Link
CN (1) CN113938746B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579015B (en) * 2022-09-23 2023-04-07 恩平市宝讯智能科技有限公司 Big data audio data acquisition management system and method
CN116168712A (en) * 2023-02-23 2023-05-26 广州趣研网络科技有限公司 Audio delay cancellation method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102811310A (en) * 2011-12-08 2012-12-05 苏州科达科技有限公司 Method and system for controlling voice echo cancellation on network video camera
CN110138650A (en) * 2019-05-14 2019-08-16 北京达佳互联信息技术有限公司 Sound quality optimization method, device and the equipment of instant messaging
CN110335618A (en) * 2019-06-06 2019-10-15 福建星网智慧软件有限公司 A kind of method and computer equipment improving non-linear inhibition
CN110970045A (en) * 2019-11-15 2020-04-07 北京达佳互联信息技术有限公司 Mixing processing method, mixing processing device, electronic equipment and storage medium
CN111372121A (en) * 2020-03-16 2020-07-03 北京文香信息技术有限公司 Echo cancellation method, device, storage medium and processor
CN111402910A (en) * 2018-12-17 2020-07-10 华为技术有限公司 Method and equipment for eliminating echo
CN113192526A (en) * 2021-04-28 2021-07-30 北京达佳互联信息技术有限公司 Audio processing method and audio processing device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9008302B2 (en) * 2010-10-08 2015-04-14 Optical Fusion, Inc. Audio acoustic echo cancellation for video conferencing
US20130294611A1 (en) * 2012-05-04 2013-11-07 Sony Computer Entertainment Inc. Source separation by independent component analysis in conjuction with optimization of acoustic echo cancellation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102811310A (en) * 2011-12-08 2012-12-05 苏州科达科技有限公司 Method and system for controlling voice echo cancellation on network video camera
CN111402910A (en) * 2018-12-17 2020-07-10 华为技术有限公司 Method and equipment for eliminating echo
CN110138650A (en) * 2019-05-14 2019-08-16 北京达佳互联信息技术有限公司 Sound quality optimization method, device and the equipment of instant messaging
CN110335618A (en) * 2019-06-06 2019-10-15 福建星网智慧软件有限公司 A kind of method and computer equipment improving non-linear inhibition
CN110970045A (en) * 2019-11-15 2020-04-07 北京达佳互联信息技术有限公司 Mixing processing method, mixing processing device, electronic equipment and storage medium
CN111372121A (en) * 2020-03-16 2020-07-03 北京文香信息技术有限公司 Echo cancellation method, device, storage medium and processor
CN113192526A (en) * 2021-04-28 2021-07-30 北京达佳互联信息技术有限公司 Audio processing method and audio processing device

Also Published As

Publication number Publication date
CN113938746A (en) 2022-01-14

Similar Documents

Publication Publication Date Title
CN113938746B (en) Network live broadcast audio processing method and device, equipment, medium and product thereof
US8126161B2 (en) Acoustic echo canceller system
CN110970045B (en) Mixing processing method, mixing processing device, electronic equipment and storage medium
US9591422B2 (en) Method and apparatus for audio interference estimation
US7881460B2 (en) Configuration of echo cancellation
US9508358B2 (en) Noise reduction system with remote noise detector
EP1913708B1 (en) Determination of audio device quality
CN108141502A (en) Audio signal processing
US11380312B1 (en) Residual echo suppression for keyword detection
CN108076239B (en) Method for improving IP telephone echo
CN103827966A (en) Processing audio signals
US11785406B2 (en) Inter-channel level difference based acoustic tap detection
JP2007515911A (en) System and method for enhancing stereo sound
CN110956976B (en) Echo cancellation method, device and equipment and readable storage medium
US8194851B2 (en) Voice processing apparatus, voice processing system, and voice processing program
US10090882B2 (en) Apparatus suppressing acoustic echo signals from a near-end input signal by estimated-echo signals and a method therefor
JP5034607B2 (en) Acoustic echo canceller system
JP2022542962A (en) Acoustic Echo Cancellation Control for Distributed Audio Devices
US10937418B1 (en) Echo cancellation by acoustic playback estimation
CN110012331A (en) A kind of far field diamylose far field audio recognition method of infrared triggering
CN113891152A (en) Audio playing control method and device, equipment, medium and product thereof
US8582754B2 (en) Method and system for echo cancellation in presence of streamed audio
US8462193B1 (en) Method and system for processing audio signals
WO2021120795A1 (en) Sampling rate processing method, apparatus and system, and storage medium and computer device
US11386911B1 (en) Dereverberation and noise reduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant