CN111212032B

CN111212032B - Audio processing method and device based on video network, electronic equipment and storage medium

Info

Publication number: CN111212032B
Application number: CN201911285285.7A
Authority: CN
Inventors: 蔡耀; 曾绳涛; 韩杰; 杨春晖
Original assignee: Visionvera Information Technology Co Ltd
Current assignee: Visionvera Information Technology Co Ltd
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2022-12-23
Anticipated expiration: 2039-12-13
Also published as: CN111212032A

Abstract

The embodiment of the application provides an audio processing method, an audio processing device, electronic equipment and a storage medium based on a video network, wherein a video network terminal is deployed in the video network, the video network terminal is in communication connection with a streaming media server, the streaming media server is in communication connection with a mobile terminal, and an audio playing component is configured on the mobile terminal, and the method is applied to an application program object arranged in the mobile terminal and comprises the following steps: when the preset audio and video call service is detected to be started, triggering a preset audio acquisition mode on the mobile terminal; receiving first audio data sent by a streaming media server, calling an audio playing component, and playing the first audio data; acquiring second audio data acquired by the mobile terminal in an audio acquisition mode; and according to the first audio data and the second audio data, performing echo cancellation processing on the second audio data to obtain target audio data subjected to echo cancellation processing. The method and the device can improve the audio call quality of the video networking conference.

Description

Audio processing method and device based on video network, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium based on a video network.

Background

Currently, with the popularization and development of the video networking service in the whole country, the video networking high-definition video networking interaction technology plays a significant role in other industries already in government departments. The video networking adopts the most advanced worldwide Vision Vera real-time high-definition video exchange technology, realizes the real-time transmission of the whole-network high-definition video which cannot be realized by the current Internet, integrates dozens of services such as high-definition video conferences, video monitoring, remote training, intelligent monitoring analysis, emergency command, video telephone, live broadcast, television mails, information distribution and the like into a system platform, and realizes the real-time interconnection and intercommunication of high-definition quality video communication through various terminal devices.

With the widespread use of video conferencing, part of video conferencing is performed in the environment of video networking and 4G networks. For example, in unmanned aerial vehicle joining meeting, unmanned aerial vehicle can connect a cell-phone usually, controls unmanned aerial vehicle through the video networking in-application of installation on the cell-phone, needs carry out audio and video conversation through this video networking in-application and the video networking terminal of commander's hall simultaneously. Generally, in the video networking video conference that unmanned aerial vehicle participated in, unmanned aerial vehicle participated in the party is generally in the field environment, and like this, the user generally can connect the microphone on the cell-phone, and under the condition of connecting the microphone, the cell-phone was gathered with the microphone while broadcasting the sound of other side. However, after the mobile phone plays the voice of the other party, the voice can generate an echo, and then the echo is transmitted to the video networking terminal together with the newly collected voice. Therefore, when the video network terminal in the command hall plays the returned sound, the other party can hear the echo sent by the other party in the previous call.

In the prior art, in order to suppress the echo, video networking video conferencing generally adopts the following modes: the method is carried out by setting the time interval, so that the human ear cannot distinguish the echo from the newly collected sound, but the method cannot completely eliminate the echo, has higher setting requirement on the time interval, and because the circulation loop is carried out all the time, the more the echoes are accumulated, the more the buzzing sound is generated at last, and the conversation quality is influenced.

Disclosure of Invention

In view of the above, embodiments of the present application are proposed to provide a method, an apparatus, an electronic device and a storage medium for video network-based audio processing, which overcome or at least partially solve the above problems.

In a first aspect, an embodiment of the present application provides an audio processing method based on a video network, where a video network terminal is deployed in the video network, the video network terminal is in communication connection with a streaming media server, the streaming media server is in communication connection with a mobile terminal, and an audio playing component is configured on the mobile terminal, and the method is applied to an application object set in the mobile terminal, and includes:

when detecting that a preset audio and video call service is started, triggering a preset audio acquisition mode on the mobile terminal;

receiving first audio data sent by the streaming media server in the audio and video call service, and calling the audio playing component to play the first audio data; the first audio data is sent to the streaming media server by the video networking terminal;

acquiring second audio data acquired by the mobile terminal in the audio acquisition mode;

according to the first audio data and the second audio data, performing echo cancellation processing on the second audio data to obtain target audio data subjected to echo cancellation processing;

and sending the target audio data to the streaming media server, wherein the streaming media server is used for sending the target audio data to the video networking terminal.

Optionally, a first microphone and a second microphone are configured on the mobile terminal, and an audio capturing mode preset on the mobile terminal is triggered, including:

invoking the first microphone and the second microphone;

acquiring second audio data acquired by the mobile terminal in the audio acquisition mode, including:

obtaining first microphone audio data captured by the first microphone and second microphone audio data captured by the second microphone;

and according to the first microphone audio data and the second microphone audio data, carrying out noise reduction processing on the second microphone audio data to obtain second audio data.

Optionally, performing echo cancellation processing on the second audio data according to the first audio data and the second audio data to obtain target audio data after echo cancellation processing, including:

determining third audio data corresponding to the first audio data in the second audio data;

and filtering the third audio data from the second audio data to obtain target audio data with the third audio data filtered.

Optionally, while triggering a preset audio capture mode on the mobile terminal, the method further includes:

calling a self-adaptive filter arranged in the mobile terminal;

determining, in the second audio data, third audio data having the same frequency as the first audio data, including:

inputting the first audio data into the adaptive filter to obtain output audio data output by the adaptive filter;

in the second audio data, third audio data having the same frequency as the output audio data is determined.

Optionally, invoking an adaptive filter configured on the mobile terminal includes:

determining at least one application program interface which is matched with the application program object on the video network terminal, and determining whether a target interface exists in the at least one application program interface;

when the target interface exists in the at least one application program interface, calling an adaptive filter corresponding to the target interface through the target interface;

when the target interface does not exist in the at least one application program interface, calling an adaptive filter corresponding to a preset application program interface through the preset application program interface

In a second aspect, an embodiment of the present application provides an audio processing apparatus based on a video network, where a video network terminal is deployed in the video network, the video network terminal is in communication connection with a streaming media server, the streaming media server is in communication connection with a mobile terminal, an audio playing component is configured on the mobile terminal, the apparatus is applied to an application object set in the mobile terminal, and the apparatus may specifically be a virtual apparatus, and specifically may include the following modules:

the audio mode triggering module is used for triggering a preset audio data acquisition mode on the mobile terminal when detecting that a preset audio call service is started;

the audio data receiving and playing module is used for receiving first audio data sent by the streaming media server, calling the audio playing component and playing the first audio data; the first audio data is sent to the streaming media server by the video networking terminal;

the audio data acquisition module is used for acquiring second audio data acquired by the mobile terminal in the audio data acquisition mode;

the audio data processing module is used for performing echo cancellation processing on the second audio data according to the first audio data and the second audio data to obtain target audio data after echo cancellation processing;

and the audio data sending module is used for sending the target audio data to the streaming media server, and the streaming media server is used for sending the target audio data to the video networking terminal.

Optionally, a first microphone and a second microphone are configured on the mobile terminal, and the audio mode triggering module may be specifically configured to invoke the first microphone and the second microphone;

the audio data acquisition module may specifically include the following units:

a microphone audio data acquisition unit configured to acquire first microphone audio data acquired by the first microphone and second microphone audio data acquired by the second microphone;

and the noise reduction processing unit is used for carrying out noise reduction processing on the second microphone audio data according to the first microphone audio data and the second microphone audio data to obtain second audio data.

Optionally, the audio data processing module may specifically include the following units:

the audio data searching unit is used for determining third audio data corresponding to the first audio data in the second audio data;

and the audio data filtering unit is used for filtering the third audio data from the second audio data to obtain target audio data with the third audio data filtered.

Optionally, the apparatus may further specifically include the following modules:

the calling module is used for calling the self-adaptive filter arranged in the mobile terminal;

the audio data searching unit may specifically include the following units:

an audio data input unit, configured to input the first audio data into the adaptive filter, so as to obtain output audio data output by the adaptive filter;

an audio data determination unit configured to determine, among the second audio data, third audio data having the same frequency as the output audio data.

Optionally, the calling module may specifically include the following units:

the target interface determining unit is used for determining at least one application program interface which is matched with the application program object on the video network terminal and determining whether a target interface exists in the at least one application program interface;

a first calling unit, configured to call, through the target interface, an adaptive filter corresponding to the target interface when the target interface exists in the at least one application program interface;

and the second calling unit is used for calling the adaptive filter corresponding to the preset application program interface through the preset application program interface when the target interface does not exist in the at least one application program interface.

In a third aspect, an embodiment of the present application further discloses an electronic device, including:

one or more processors; and

one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the device to perform one or more video networking-based audio processing methods as described in embodiments of the application.

In a fourth aspect, this application further discloses a computer-readable storage medium storing a computer program for causing a processor to execute the method for processing audio based on video networking according to this application.

Compared with the prior art, the embodiment of the application has the following advantages:

in the embodiment of the application, when an application object arranged in the mobile terminal detects that a preset audio and video call service is started, a preset audio acquisition mode is triggered. Then, when first audio data sent by the streaming media server is received, an audio playing component on the mobile terminal can be called to play the first audio data, then second audio data collected by the mobile terminal in an audio collection mode is obtained, echo cancellation processing is carried out on the second audio data according to the first audio data, target audio data are obtained, and then the target audio data after the echo cancellation processing can be sent to the video networking terminal through the streaming media server. Because the application program object can enable the mobile terminal to collect the second audio data in the preset audio collection mode, the audio quality of the second audio data can be improved, and because the application program object performs echo cancellation processing on the collected second audio data according to the first audio data, echoes generated by playing the first audio data can be eliminated in the second audio data, the effect of the application program object on echo processing can be improved, and the conversation quality is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings may be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic view of a video networking of the present application;

FIG. 2 is a schematic diagram of a hardware architecture of a node server according to the present application;

fig. 3 is a schematic diagram of a hardware architecture of an access switch of the present application;

fig. 4 is a schematic diagram of a hardware structure of an ethernet protocol conversion gateway according to the present application;

fig. 5 is an application scene diagram of an audio processing method based on a video network according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating steps of a method for video-networking based audio processing according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an audio processing apparatus based on a video network according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

The video networking is an important milestone for network development, is a real-time network, can realize high-definition video real-time transmission, and pushes a plurality of internet applications to high-definition video, and high-definition faces each other.

The video networking adopts a real-time high-definition video exchange technology, can integrate required services such as dozens of services of video, voice, pictures, characters, communication, data and the like on a system platform on a network platform, such as high-definition video conference, video monitoring, intelligent monitoring analysis, emergency command, digital broadcast television, delayed television, network teaching, live broadcast, VOD on demand, television mail, personal Video Recorder (PVR), intranet (self-office) channels, intelligent video broadcast control, information distribution and the like, and realizes high-definition quality video broadcast through a television or a computer.

In order to make the embodiments of the present application better understood, the following description refers to the internet of view:

some of the techniques applied by the video network are as follows:

network technology (network technology)

Network technology innovation in video networking has improved the traditional Ethernet (Ethernet) to face the potentially huge first video traffic on the network. Unlike pure network Packet Switching (Packet Switching) or network Circuit Switching (Circuit Switching), the Packet Switching is adopted by the technology of the video networking to meet the Streaming requirement. The video networking technology has the advantages of flexibility, simplicity and low price of packet switching, and simultaneously has the quality and safety guarantee of circuit switching, and realizes the seamless connection of a whole network switching type virtual circuit and a data format.

Switching Technology (Switching Technology)

The video network adopts two advantages of asynchronism and packet exchange of the Ethernet, eliminates the Ethernet defect on the premise of full compatibility, and has end-to-end seamless connection of the whole network, direct connection with a user terminal and direct bearing of an IP data packet. The user data does not require any format conversion across the entire network. The video networking is a higher-level form of the Ethernet, is a real-time exchange platform, can realize the real-time transmission of the whole-network large-scale high-definition video which cannot be realized by the existing Internet, and pushes a plurality of network video applications to high-definition and unification.

Server technology (Servertechnology)

The server technology on the video networking and unified video platform is different from the traditional server, the streaming media transmission of the video networking and unified video platform is established on the basis of connection orientation, the data processing capacity of the video networking and unified video platform is independent of flow and communication time, and a single network layer can contain signaling and data transmission. For voice and video services, the complexity of video networking and unified video platform streaming media processing is much simpler than that of data processing, and the efficiency is greatly improved by more than one hundred times compared with that of a traditional server.

Storage Technology (Storage Technology)

The super-high speed memory technology of the unified video platform adopts the most advanced real-time operating system in order to adapt to the media content with super-large capacity and super-large flow, the program information in the server instruction is mapped to the specific hard disk space, the media content is not passed through the server any more, and is instantly and directly sent to the user terminal, and the user waiting time is less than 0.2 second. The optimized sector distribution greatly reduces the mechanical motion of the magnetic head track seeking of the hard disk, the resource consumption only accounts for 20% of that of the IP internet of the same grade, but concurrent flow which is 3 times larger than that of the traditional hard disk array is generated, and the comprehensive efficiency is improved by more than 10 times.

Network Security Technology (Network Security Technology)

The structural design of the video network completely eliminates the network security problem troubling the internet structurally by the modes of independent service permission control each time, complete isolation of equipment and user data and the like, generally does not need antivirus programs and firewalls, avoids the attack of hackers and viruses, and provides a structural carefree security network for users.

Service Innovation Technology (Service Innovation Technology)

The unified video platform integrates services and transmission, and is not only automatically connected once, but also connected with a single user, a private network user or the sum of one network. The user terminal, the set-top box or the PC are directly connected to the unified video platform to obtain various multimedia video services in various forms. The unified video platform adopts a menu type configuration table mode to replace the traditional complex application programming, can realize complex application by using very few codes, and realizes infinite new service innovation.

Networking of the video network is as follows:

an internet of view is a centrally controlled network structure, which may be of the tree, star, ring, etc. type, but on this basis a centralized control node is required in the network to control the entire network.

As shown in fig. 1, the video network is divided into an access network and a metropolitan network.

The devices of the access network part can be mainly classified into 3 types: node server, access switch, terminal (including various set-top boxes, coding boards, memories, etc.). The node server is connected to an access switch, which may be connected to a plurality of terminals and may be connected to an ethernet network.

The node server is a node which plays a centralized control function in the access network and can control the access switch and the terminal. The node server can be directly connected with the access switch or directly connected with the terminal.

Similarly, devices of the metropolitan network portion may also be classified into 3 types: a metropolitan area server, a node switch and a node server. The metro server is connected to a node switch, which may be connected to a plurality of node servers.

The node server is a node server of the access network part, namely the node server belongs to both the access network part and the metropolitan area network part.

The metropolitan area server is a node which plays a centralized control function in the metropolitan area network and can control a node switch and a node server. The metropolitan area server can be directly connected with the node switch and can also be directly connected with the node server.

Therefore, the whole video network is a network structure with layered centralized control, and the network controlled by the node server and the metropolitan area server can be in various structures such as tree, star and ring.

The access network part can form a unified video platform (the part in the dotted circle), and a plurality of unified video platforms can form a video network; each unified video platform may be interconnected via metropolitan area and wide area video networking.

Video networking device classification

1.1 devices in the video network of the embodiment of the present application can be mainly classified into 3 types: server, exchanger (including Ethernet protocol gateway), terminal (including various set-top box, coding board, memory, etc.). The video network as a whole can be divided into a metropolitan area network (or national network, global network, etc.) and an access network.

1.2 wherein the devices of the access network part can be mainly classified into 3 types: node server, access switch (including Ethernet protocol gateway), terminal (including various set-top boxes, coding board, memory, etc.).

The specific hardware structure of each access network device is as follows:

a node server:

as shown in fig. 2, the system mainly includes a network interface module 201, a switching engine module 202, a CPU module 203, and a disk array module 204;

the packets coming from the network interface module 201, the cpu module 203 and the disk array module 204 all enter the switching engine module 202; the switching engine module 202 performs an operation of looking up the address table 205 on the incoming packet, thereby obtaining the direction information of the packet; and stores the packet in a queue of the corresponding packet buffer 206 based on the packet's steering information; if the queue of the packet buffer 206 is nearly full, it is discarded; the switching engine module 202 polls all packet buffer queues and forwards if the following conditions are met: 1) The port send buffer is not full; 2) The queued packet counter is greater than zero. The disk array module 204 mainly implements control over the hard disk, including initialization, reading, writing, and other operations on the hard disk; the CPU module 203 is mainly responsible for protocol processing with an access switch and a terminal (not shown in the figure), configuring an address table 205 (including a downlink protocol packet address table, an uplink protocol packet address table, and a data packet address table), and configuring the disk array module 204.

The access switch:

as shown in fig. 3, the network interface module mainly includes a network interface module (a downlink network interface module 301 and an uplink network interface module 302), a switching engine module 303 and a CPU module 304;

wherein, the packet (uplink data) coming from the downlink network interface module 301 enters the packet detection module 305; the packet detection module 305 detects whether the Destination Address (DA), the Source Address (SA), the packet type, and the packet length of the packet meet the requirements, if so, allocates a corresponding stream identifier (stream-id) and enters the switch engine module 303, otherwise, discards the stream identifier; the packet (downstream data) coming from the upstream network interface module 302 enters the switching engine module 303; the incoming data packet of the CPU module 304 enters the switching engine module 303; the switching engine module 303 performs an operation of looking up the address table 306 on the incoming packet, thereby obtaining the direction information of the packet; if the packet entering the switching engine module 303 is from the downstream network interface to the upstream network interface, the packet is stored in the queue of the corresponding packet buffer 307 in association with the stream-id; if the queue of the packet buffer 307 is nearly full, it is discarded; if the packet entering the switching engine module 303 is not from the downlink network interface to the uplink network interface, the data packet is stored in the queue of the corresponding packet buffer 307 according to the guiding information of the packet; if the queue of the packet buffer 307 is nearly full, it is discarded.

The switching engine module 303 polls all packet buffer queues, which may include two cases:

if the queue is from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) The port send buffer is not full; 2) The queued packet counter is greater than zero; 3) Obtaining a token generated by a code rate control module;

if the queue is not from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) The port send buffer is not full; 2) The queued packet counter is greater than zero.

The rate control module 308 is configured by the CPU module 304, and generates tokens for the packet buffer queues from all the downstream network interfaces to the upstream network interfaces at programmable intervals to control the rate of upstream forwarding.

The CPU module 304 is mainly responsible for protocol processing with the node server, configuration of the address table 306, and configuration of the code rate control module 308.

Ethernet co-rotating gateway：

As shown in fig. 4, the apparatus mainly includes a network interface module (a downlink network interface module 401 and an uplink network interface module 402), a switch engine module 403, a CPU module 404, a packet detection module 405, a rate control module 408, an address table 406, a packet buffer 407, a MAC adding module 409, and a MAC deleting module 410.

Wherein, the data packet coming from the downlink network interface module 401 enters the packet detection module 405; the packet detection module 405 detects whether the ethernet MAC DA, the ethernet MAC SA, the ethernet length or frame type, the video network destination address DA, the video network source address SA, the video network packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id); then, the MAC deletion module 410 subtracts MAC DA, MAC SA, length or frame type (2 byte) and enters the corresponding receiving buffer, otherwise, discards it;

the downlink network interface module 401 detects the sending buffer of the port, and if there is a packet, obtains the ethernet MAC DA of the corresponding terminal according to the destination address DA of the packet, adds the ethernet MAC DA of the terminal, the MAC SA of the ethernet protocol gateway, and the ethernet length or frame type, and sends the packet.

The other modules in the ethernet protocol gateway function similarly to the access switch.

A terminal:

the system mainly comprises a network interface module, a service processing module and a CPU module; for example, the set-top box mainly comprises a network interface module, a video and audio coding and decoding engine module and a CPU module; the coding board mainly comprises a network interface module, a video and audio coding engine module and a CPU module; the memory mainly comprises a network interface module, a CPU module and a disk array module.

1.3 devices of the metropolitan area network part can be mainly classified into 2 types: node server, node exchanger, metropolitan area server. The node switch mainly comprises a network interface module, a switching engine module and a CPU module; the metropolitan area server mainly comprises a network interface module, a switching engine module and a CPU module.

2. Vission networking data packet definition

2.1 Access network packet definition

The data packet of the access network mainly comprises the following parts: destination Address (DA), source Address (SA), reserved byte, payload (PDU), CRC.

As shown in the following table, the data packet of the access network mainly includes the following parts:

DA

SA

Reserved

Payload

CRC

wherein:

the Destination Address (DA) is composed of 8 bytes (byte), the first byte represents the type of the data packet (such as various protocol packets, multicast data packets, unicast data packets, etc.), there are 256 possibilities at most, the second byte to the sixth byte are metropolitan area network addresses, and the seventh byte and the eighth byte are access network addresses;

the Source Address (SA) is also composed of 8 bytes (byte), and is defined to be the same as the Destination Address (DA);

the reserved byte consists of 2 bytes;

the payload part has different lengths according to the types of different datagrams, 64 bytes if it is a packet of various protocols, 32+1024=1056 bytes if it is a packet of unicast data, and certainly not limited to the above 2 types;

the CRC consists of 4 bytes and is calculated in accordance with the standard ethernet CRC algorithm.

2.2 metropolitan area network packet definition

The topology of a metropolitan area network is a graph and there may be 2, or even more than 2, connections between two devices, i.e., more than 2 connections between a node switch and a node server, between a node switch and a node switch, and between a node switch and a node server. However, the address of the metro network device is unique, and in order to accurately describe the connection relationship between the metro network devices, parameters are introduced in the embodiment of the present application: a label to uniquely describe a metropolitan area network device.

In this specification, the definition of the Label is similar to that of the Label of MPLS (Multi-Protocol Label Switch), and assuming that there are two connections between the device a and the device B, there are 2 labels for the packet from the device a to the device B, and 2 labels for the packet from the device B to the device a. The label is classified into an incoming label and an outgoing label, and assuming that the label (incoming label) of the packet entering the device a is 0x0000, the label (outgoing label) of the packet leaving the device a may become 0x0001. The network access process of the metro network is a network access process under centralized control, that is, address allocation and label allocation of the metro network are both dominated by the metro server, and the node switch and the node server are both passively executed, which is different from label allocation of MPLS, and label allocation of MPLS is a result of mutual negotiation between the switch and the server.

As shown in the following table, the data packet of the metro network mainly includes the following parts:

DA

SA

Reserved

label (R)

Payload

CRC

Namely Destination Address (DA), source Address (SA), reserved byte (Reserved), tag, payload (PDU), CRC. The format of the tag may be defined by reference to the following: the tag is 32 bits with the upper 16 bits reserved and only the lower 16 bits used, which is located between the reserved bytes and the payload of the packet.

Based on the characteristics of the video network, video conferences performed based on the video network are more and more, and application scenes of the video conferences are more and more diverse. Therefore, in order to ensure the conference quality of the video conference performed in different application scenarios, it is necessary to ensure the stability of the audio call quality in the video conference.

For example, one application scenario is a video conference scenario for commanding an unmanned aerial vehicle, in the video conference, a video networking terminal in a video networking and a mobile terminal in the internet perform audio and video communication, wherein an application service for the application scenario is installed on the mobile terminal in the internet, and the application service can provide a local service for the user to control the unmanned aerial vehicle and perform audio and video communication with the video networking terminal.

In such an application scenario, in order to ensure the quality of audio call of a mobile terminal which is often located in the field, it is necessary to avoid echo in the audio sent by the mobile terminal. The current approach is to interface the application service to an echo suppression tool of a third party so that the generation of echoes can be avoided by setting the time interval. However, when a third-party echo suppression tool is docked, a library of the third party must be packaged into a so library (dynamic link library) that can be called by the mobile terminal by using the c language, and the so library is called through a jni (java native Interface) Interface, which is very cumbersome. The development workload of developers is increased, the effect of echo suppression is not good, and after a call is carried out for a period of time, buzzing occurs, so that the call quality is influenced.

Based on this, the present applicant has conceived one of the technical ideas of the present application to improve the quality of an audio call on a mobile terminal side in the application scenario by comprehensively considering the characteristics of the video network and the application object, and to solve at least the problem that a buzz occurs after a call is performed for a while, which is one of the technical problems described above. Specifically, when the application object detects that the audio/video call service is started, a preset audio acquisition mode on the mobile terminal is triggered, so that the mobile terminal acquires second audio data in the acquisition mode, and performs echo cancellation on the second audio data according to the first audio data, thereby canceling echo generated by playing the first audio data in the second audio data. The problems that the process of butting the third party is complicated and the echoes are overlapped along with the accumulation of time so as to reduce the communication quality caused by the fact that the time interval is set by calling the echo suppression library of the third party and the echoes cannot be distinguished by human ears are solved.

Referring to fig. 5, an application scenario of the audio processing method based on the video networking is shown, in the application scenario, a video networking terminal performs a video networking video conference with an internet terminal, wherein the internet terminal establishes a communication connection with an unmanned aerial vehicle through a remote control handle, specifically, the remote control handle sends a control instruction to the unmanned aerial vehicle to control the flight altitude and the course of the unmanned aerial vehicle, wherein a picture and flight data taken by the unmanned aerial vehicle are returned to the internet terminal, and further, the internet terminal can return the picture and flight data to the video networking terminal through the video networking.

The video network terminal is deployed in the video network, the video network terminal is in communication connection with the streaming media server, and the streaming media server is in communication connection with the mobile terminal. Mobile terminal, unmanned aerial vehicle and remote control handle all can dispose in the internet, and the streaming media server can communicate with the video networking terminal in the video networking respectively and with the mobile terminal communication in the internet.

Wherein, an audio playing component is configured on the mobile terminal, wherein the audio playing component may be, but is not limited to, the following audio player: mp3, mp4, etc. The mobile terminal is further provided with an application program object, the application program object can correspond to the streaming media server, the application program object can provide local services such as data analysis, storage, transmission and video networking audio and video conference of the unmanned aerial vehicle for a user, and the streaming media server can provide background services for the services performed by the application program object, such as data forwarding services.

In the embodiment of the application, the mobile terminal can be a terminal for installing an android system, such as an android mobile phone and an android tablet computer.

Referring to fig. 6, a flowchart illustrating steps of an audio processing method based on a video network according to an embodiment of the present application is shown, where the method may be applied to the application object set in the mobile terminal, and as shown in fig. 6, the method may specifically include the following steps:

step S601, when detecting that a preset audio/video call service is opened, triggering a preset audio acquisition mode on the mobile terminal.

In this embodiment, the preset audio/video call service may refer to that the audio/video call service is configured in the application object in advance, and is used as one service of a plurality of services provided by the application object. For example, the application object may provide services such as drone data analysis, audio-video calls, monitoring playback, and so on.

In specific implementation, the application object may detect whether the audio call service is started according to the operation of the user on the audio and video call service, and in practice, if the user performs a start operation on the audio and video call service, the application object may detect the start operation, and further trigger an audio acquisition mode preset on the mobile terminal according to the start operation.

The preset audio acquisition mode may refer to: and an audio acquisition mode corresponding to the audio and video call service on the application program object is preset on the mobile terminal. The audio acquisition mode can control the mode that the mobile terminal acquires audio data and preprocesses the audio data. In practice, when the audio/video call service is started, the application object may trigger the audio acquisition mode, so that the mobile terminal may acquire audio data in the audio acquisition mode. The triggering in the embodiment of the present application may refer to starting, i.e., starting the preset audio capture mode.

In this embodiment, the audio capture mode corresponding to the audio data with the least environmental noise may be determined as the preset audio capture mode according to the audio data that is respectively captured by the mobile terminal in each original audio capture mode.

Exemplarily, taking a mobile terminal as an android mobile phone device as an example, 9 original audio acquisition modes are included in the android mobile phone device, which are respectively:

AudioSource, defiult DEFAULT mode;

mic microphone mode;

audio, voice _ UPLINK telephone UPLINK mode;

VOICE _ DOWNLINK telephone down mode;

audio. Voice _ CALL phone up + down mode;

audiosource. Camport camera mode;

voice RECOGNITION mode of audio source, voice RECOGNITION;

voice COMMUNICATION mode, such as VoIP (Voice over Internet Protocol) mode;

remote _ subfix remote voice mode, such as wifi display mode.

Through actual tests, it is determined that the noise included in the audio data acquired by the mobile terminal in the voice communication mode is minimum, and the voice communication mode can be determined as a preset audio acquisition mode, so that when an application program object set in the android mobile phone detects that the audio and video communication service is started, the voice communication mode can be triggered.

Step S602, receiving first audio data sent by the streaming media server in the audio/video call service, and calling the audio playing component to play the first audio data.

And the first audio data is sent to the streaming media server by the video networking terminal.

In this embodiment, the first audio data is data collected by the video network terminal, and is sent to the streaming media server by the video network terminal through the video network, and then is sent to the application object on the mobile terminal by the streaming media server.

During specific implementation, the mobile terminal can decode and play the first audio data by the calling audio playing component, and can cache the first audio data at the same time, so that echo cancellation can be performed on the audio data newly acquired by the mobile terminal according to the cached first audio data in the subsequent process.

Step S603, acquiring second audio data acquired by the mobile terminal in the audio acquisition mode.

In practice, after the first audio data is played, the application object starts to acquire the second audio data acquired by the mobile terminal in the audio acquisition mode. Because the sound emitted by the playing component is reflected by the surrounding environment to form an echo in the process of playing the first audio data, and because the echo has a longer path than the sound directly transmitted, after the first audio data is played, the echo generated by the first audio data and the speaking sound of the user are collected by the mobile terminal. Therefore, echo audio data reflecting the played first audio data from the environment is included in the second audio data.

In this embodiment, because the environmental noise included in the audio data acquired in the preset audio acquisition mode may be the least, the environmental noise included in the second audio data acquired in the preset audio acquisition mode is also less, and thus the audio quality of the second audio data is improved. Specifically, the ambient noise refers to noise generated in the surrounding environment where the mobile terminal is located, and the echo may be one of the ambient noises, and when the ambient noise included in the second audio data collected in the preset audio collection mode is less, the echo may also be less represented therein. Therefore, the quality of the collected second audio data can be improved, and a better echo cancellation effect can be obtained.

Step S604, performing echo cancellation processing on the second audio data according to the first audio data and the second audio data, to obtain target audio data after echo cancellation processing.

In this embodiment, when the second audio data is obtained, the echo audio data included in the second audio data may be eliminated according to the cached first audio data, so as to obtain the target audio data from which the echo audio data is eliminated. The echo audio data is audio data which is collected by the mobile terminal and generated by reflecting the first audio data in playing by the environment.

Thus, the obtained target audio data does not include echo audio data, thereby realizing echo cancellation processing at the mobile terminal side in the audio and video call service.

Step S605, sending the target audio data to the streaming media server, where the streaming media server is configured to send the target audio data to the video network terminal.

After the target audio data is obtained, the application object may send the target audio data to the streaming media server, so that the streaming media server sends the target audio data to the video network terminal through the video network. Because the target audio data does not include the echo audio data, when the video network terminal plays the target audio data, the sound of the user of the mobile terminal is clear, and the voice sent by the user of the video network terminal in the previous time can not be heard, so that the conversation quality is improved.

In the embodiment of the present application, since the second audio data collected in the preset audio collection mode includes less environmental noise, the echo represented therein is also less. Therefore, the quality of the acquired second audio data can be improved. And because the second audio data is subjected to echo cancellation processing according to the first audio data, the obtained target audio data does not include echo audio data, and the audio quality of the transmitted target audio data is improved. Compared with a time interval setting mode, the method and the device have the advantages that the target audio data sent out by the method and the device do not carry echo audio data, so that the conversation quality can be guaranteed, and clear conversation is achieved. The problems that the echo can still be heard by human ears due to unreasonable time interval setting and the echo accumulates more and more in the transmitted audio data to generate buzz sound as time goes by are avoided.

With reference to the foregoing embodiment, in an optional example, if a first microphone and a second microphone are configured on the mobile terminal, in step S601, triggering an audio capture mode preset on the mobile terminal includes the following steps:

step S6011, the first microphone and the second microphone are called.

In practice, the first microphone and the second microphone may be disposed at different locations of the mobile terminal, alternatively, the first microphone may be disposed at the bottom of the mobile terminal and the second microphone may be disposed at the top of the mobile terminal. As can be seen from the description of the audio acquisition mode in step S601, a plurality of original audio acquisition modes may be set in the mobile terminal, and in practice, the microphones called by each of the audio acquisition modes are not the same, and the ways of preprocessing the audio data acquired by the microphones may also be different.

In an embodiment of the application, the preset audio capturing mode may correspond to calling the first microphone and the second microphone simultaneously. That is, when the preset audio capture mode is triggered, the application object may call the first microphone and the second microphone at the same time, so as to capture audio data by using the first microphone and the second microphone.

Correspondingly, step S603 may specifically include the following steps:

step S6031, first microphone audio data collected by the first microphone and second microphone audio data collected by the second microphone are acquired.

In practice, after the first microphone and the second microphone are called, the first microphone and the second microphone may simultaneously acquire audio data.

Since the first microphone and the second microphone are located at different positions on the mobile terminal, the audio data collected by the first microphone and the second microphone have a difference. Specifically, since the second microphone at the top and the first microphone at the bottom are different in distance from the user at the time of a call, the volume of the user's voice included in the first microphone audio data and the volume of the user's voice included in the second microphone audio data are different in magnitude, and the volumes of background noises picked up by the two microphones are substantially the same, so that it is possible to filter out the noise-remaining human voice by using the above difference.

Step S6032, performing noise reduction processing on the second microphone audio data according to the first microphone audio data and the second microphone audio data to obtain second audio data.

In this embodiment of the application, in a preset audio capture mode, noise reduction processing may be performed on second microphone audio data captured by a second microphone. Specifically, the first microphone audio data and the second microphone audio data may be decoded to generate a compensation signal, and then the second microphone audio data is subjected to noise reduction processing according to the compensation signal, so that the ambient noise in the second microphone audio data may be removed, and the second audio data after the noise reduction processing is obtained.

With reference to the foregoing embodiment, in an optional example, step S604 may specifically include the following steps:

step S6041 of determining third audio data corresponding to the first audio data among the second audio data.

In this optional example, when the echo cancellation processing is performed on the second audio data, since the echo audio data included in the second audio data is an echo reflected by playing the first audio data, the echo audio data is audio data related to the first audio data. For example, the echo audio data and the first audio data are both sounds from the same user, and their acoustic characteristics are the same although reflected. Therefore, the third audio data with the matching degree of the acoustic features of the first audio data being greater than the preset matching degree can be found from the second audio data based on the voice recognition technology, and the determined third audio data is the echo audio data reflected by the playing of the first audio data. The acoustic feature may be a frequency feature or a spectral feature of the audio data, among others.

Step S6042, filtering the third audio data from the second audio data to obtain target audio data with the third audio data filtered.

In practice, after the third audio data is determined, the third audio data may be removed from the second audio data to obtain the target audio data.

Accordingly, in an optional example, while triggering the audio capturing mode preset on the mobile terminal, the method may further specifically include the following steps:

step S6012, an adaptive filter set in the mobile terminal is called.

In this optional example, in order to improve the efficiency of performing the echo cancellation processing on the second audio data and reduce the amount of bottom-layer development of the application object, the application object may call an adaptive filter configured in the mobile terminal, so that the adaptive filter performs the echo cancellation processing on the second audio data.

Optionally, step S6012 may specifically include the following steps:

step S6012-1, determining at least one application program interface adapted to the application program object on the video network terminal, and determining whether a target interface exists in the at least one application program interface.

In practice, different adaptive filters can be obtained based on different bottom-layer development software, the efficiency and quality of echo cancellation processing performed by different adaptive filters can be different, and different adaptive filters correspond to different application program interfaces.

In this embodiment of the present application, a call relationship may be established between the application program interfaces corresponding to the different adaptive filters and the application program object, and the application program interface adapted to the application program object on the video network terminal is determined, that is, at least one application program interface having a call relationship with the application program object on the video network terminal may be determined. Therefore, when the application program object carries out echo cancellation processing on the second audio data, the adaptive filter configured by the mobile terminal can be successfully called by calling the application program interface establishing a calling relation with the application program object, and compared with the situation of an echo suppression tool by a third party, the adaptive filter has the advantages that the calling process is simplified, and the efficiency is improved.

The target interface may be preset, specifically, each application program interface has a respective identifier, and in practice, it is determined whether a target interface exists in the at least one application program interface, where it may be determined whether an application program interface whose identifier of the application program interface is consistent with the identifier of the target interface exists in the at least one application program interface.

In an alternative example, since the efficiency and quality of the echo cancellation processing performed by different adaptive filters are different, in practice, each of the at least one application program interface may have a priority, and the higher the priority is, the better the efficiency and quality of the echo cancellation processing performed by the adaptive filter corresponding to the application program interface is. In this embodiment, the target interface may refer to an application program interface with a priority higher than a preset priority, that is, it may be determined whether there is a target interface with a priority higher than a preset priority in at least one application program interface

In specific implementation, taking the mobile terminal as an android phone as an example, the target interface may be an AEC (acoustic echo canceller) interface. Because the AEC can develop the echo cancellation procedure very quickly, the adaptive filter developed based on the AEC can perform echo cancellation on the audio data quickly, and the quality of the echo cancellation is good, so that the audio call quality of the application can be improved.

If the target interface is determined to exist, the step S6012-2 is performed, and if the target interface is determined to not exist, the step S6012-3 is performed.

And step S6012-2, calling an adaptive filter corresponding to the target interface through the target interface.

In practice, when the target interface exists in the at least one application program interface, the adaptive filter corresponding to the interface may be called through the target interface.

For example, if there is an AEC interface, the adaptive filter corresponding to the AEC interface is called through the AEC interface.

Step S6012-3, the adaptive filter corresponding to the preset application program interface is called through the preset application program interface.

In this embodiment, the preset application program interface may refer to a standby application program interface in the at least one application program interface, and in practice, one of the application program interfaces may be specified as the standby application program interface in each application program interface where the application program object establishes a call relationship, and when it is determined that the target interface does not exist in the at least one application program, the application program object may call the standby application program interface immediately.

For example, taking the mobile terminal as an android phone and the target interface as an AEC interface as an example, in practice, the AEC interface is not necessarily applicable to a model of the mobile terminal, and in this case, a speex (echo cancellation algorithm) interface may be used as a preset application program interface, and the adaptive filter may be called through the speex interface. The speed can be suitable for various types of mobile terminals, and the adaptation range is wide, so that the speed interface can be used as a standby application program interface.

Accordingly, in an alternative example, since each application program interface may have a respective priority, the priority of the preset application program interface is the same as the level of the next priority adjacent to the preset priority.

After the adaptive filter is successfully called, the second audio data may be subjected to echo cancellation processing by using the adaptive filter, and specifically, step S6041 and step S6042 may be separately performed by using the adaptive filter, where step S6041 may specifically include the following steps:

s60411, the first audio data is input to the adaptive filter, and output audio data output by the adaptive filter is obtained.

When the first audio data is received, the first audio data is buffered, so that the first audio data can be extracted from the buffer, the first audio data is input to the adaptive filter, and the output audio data is obtained after the first audio data is processed by the adaptive filter.

Specifically, taking an adaptive filter corresponding to the AEC interface as an example, assuming that the first audio data is x (n), x (n) is input into the adaptive filter, and the adaptive filter updates and adjusts the weighting coefficients according to a specific algorithm for each sample of the input signal sequence x (n) so that the mean square error of the comparison between the output signal sequence y (n) and the expected output signal sequence d (n) is minimum, i.e., the output signal sequence y (n) approximates the expected signal sequence d (n), and the more y (n) approximates d (n), the more y (n) is consistent with x (n). The output signal sequence y (n) is output audio data, and the coefficients of the adaptive filter designed based on the minimum mean square error can be solved by wiener-hopff equation.

S60412, of the second audio data, third audio data having the same frequency as the output audio data is determined.

Wherein the application object may determine third audio data having the same frequency as the output audio data among the second audio data using an adaptive filter. Since the output audio data corresponds to x (n), the third audio data having the same frequency as the output audio data may represent that the third audio data is audio data related to the first audio data.

Accordingly, step S6042 may specifically be the following step:

step S6043, by using the adaptive filter, filtering the third audio data from the second audio data, and obtaining target audio data with the third audio data filtered.

In practice, when the third audio data is determined, the third audio data may be filtered from the second audio data by using an adaptive filter.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those of skill in the art will recognize that the embodiments described in this specification are presently preferred embodiments and that no particular act is required to implement the embodiments of the disclosure.

Referring to fig. 7, a block diagram of a video network-based audio processing apparatus according to an embodiment of the present application is shown, where a video network terminal is deployed in the video network, the video network terminal is communicatively connected to a streaming media server, the streaming media server is communicatively connected to a mobile terminal, and an audio playing component is configured on the mobile terminal, the apparatus is applied to an application object set in the mobile terminal, and the apparatus may specifically be a virtual apparatus, and specifically may include the following modules:

the audio mode triggering module 701 is configured to trigger an audio data acquisition mode preset on the mobile terminal when detecting that a preset audio call service is started;

an audio data receiving and playing module 702, configured to receive first audio data sent by the streaming media server, call the audio playing component, and play the first audio data; the first audio data is sent to the streaming media server by the video networking terminal;

an audio data acquisition module 703, configured to acquire second audio data acquired by the mobile terminal in the audio data acquisition mode;

an audio data processing module 704, configured to perform echo cancellation processing on the second audio data according to the first audio data and the second audio data, so as to obtain target audio data after echo cancellation processing;

an audio data sending module 705, configured to send the target audio data to the streaming media server, where the streaming media server is configured to send the target audio data to the video networking terminal.

the audio data acquisition module may specifically include the following units:

the audio data searching unit may specifically include the following units:

Optionally, the calling module may specifically include the following units:

a first calling unit, configured to call the adaptive filter through the target interface when the target interface exists in the at least one application program interface;

and the second calling unit is used for calling the adaptive filter through a preset application program interface when the target interface does not exist in the at least one application program interface.

For the embodiment of the audio processing device based on the video network, since it is basically similar to the embodiment of the audio processing method based on the video network, the description is simple, and the relevant points can be referred to the partial description of the embodiment of the audio processing method based on the video network.

An embodiment of the present application further provides an electronic device, including:

one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the device to perform one or more video networking-based audio processing methods as described in embodiments of the present application.

Embodiments of the present application further provide a computer-readable storage medium storing a computer program that causes a processor to execute the method for processing audio based on video networking according to the embodiments of the present application.

The embodiments in the present specification are all described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same and similar between the embodiments may be referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or terminal device that comprises the element.

The foregoing describes in detail an audio processing method, apparatus, electronic device and storage medium based on video networking, and a specific example is applied in the present disclosure to explain the principle and implementation of the present disclosure, and the description of the foregoing embodiments is only used to help understand the method and core ideas of the present disclosure; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An audio processing method based on a video network is characterized in that a video network terminal is deployed in the video network, the video network terminal is in communication connection with a streaming media server, the streaming media server is in communication connection with a mobile terminal, and an audio playing component is configured on the mobile terminal, and the method is applied to an application program object arranged in the mobile terminal, and comprises the following steps:

when detecting that a preset audio and video call service is started, triggering a preset audio acquisition mode on the mobile terminal, wherein the preset audio acquisition mode is an audio acquisition mode corresponding to audio data with the least environmental noise;

according to the first audio data and the second audio data, performing echo cancellation processing on the second audio data to obtain target audio data after echo cancellation processing;

2. The method according to claim 1, wherein a first microphone and a second microphone are configured on the mobile terminal, and triggering a preset audio capture mode on the mobile terminal comprises:

invoking the first microphone and the second microphone;

3. The method of claim 1, wherein performing echo cancellation processing on the second audio data according to the first audio data and the second audio data to obtain target audio data after echo cancellation processing, comprises:

4. The method according to claim 3, wherein while triggering a preset audio capture mode on the mobile terminal, the method further comprises:

calling a self-adaptive filter arranged in the mobile terminal;

determining, among the second audio data, third audio data having the same frequency as the first audio data, including:

5. The method of claim 4, wherein invoking the adaptive filter configured on the mobile terminal comprises:

and when the target interface does not exist in the at least one application program interface, calling the adaptive filter corresponding to the preset application program interface through a preset application program interface.

6. An audio processing device based on video networking, wherein a video networking terminal is deployed in the video networking, the video networking terminal is in communication connection with a streaming media server, the streaming media server is in communication connection with a mobile terminal, an audio playing component is configured on the mobile terminal, and the device is applied to an application program object arranged in the mobile terminal, and comprises:

the mobile terminal comprises an audio mode triggering module, a voice acquisition module and a voice processing module, wherein the audio mode triggering module is used for triggering a preset audio acquisition mode on the mobile terminal when detecting that a preset audio communication service is started, and the preset audio acquisition mode is an audio acquisition mode corresponding to audio data with the least environmental noise;

the audio data acquisition module is used for acquiring second audio data acquired by the mobile terminal in the audio acquisition mode;

and the audio data sending module is used for sending the target audio data to the streaming media server, and the streaming media server is used for sending the target audio data to the video network terminal.

7. The apparatus according to claim 6, wherein a first microphone and a second microphone are configured on the mobile terminal, and the audio mode triggering module is specifically configured to invoke the first microphone and the second microphone;

the audio data acquisition module comprises:

8. The apparatus of claim 6, wherein the audio data processing module comprises:

9. An electronic device, comprising:

one or more processors; and

one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the electronic device to perform the video networking-based audio processing method of any of claims 1-5.

10. A computer-readable storage medium storing a computer program for causing a processor to execute the video network-based audio processing method according to any one of claims 1 to 5.