CN108965777B

CN108965777B - Echo cancellation method and device

Info

Publication number: CN108965777B
Application number: CN201710751905.6A
Authority: CN
Inventors: 刘宝臣; 韩杰; 杨春晖; 王艳辉
Original assignee: Visionvera Information Technology Co Ltd
Current assignee: Hainan Shilian Communication Technology Co.,Ltd.
Priority date: 2017-08-28
Filing date: 2017-08-28
Publication date: 2020-07-17
Anticipated expiration: 2037-08-28
Also published as: CN108965777A

Abstract

The embodiment of the invention provides an echo cancellation method and device, wherein the method is applied to a video network, and the video network comprises a video network server, a first video conference terminal and a second video conference terminal; the method comprises the following steps: the first video conference terminal determines a filter coefficient; calculating the fixed time delay between the first video conference terminal and the second video conference terminal; acquiring an initial data amount of a reference data buffer; adjusting the initial data volume to a target data volume according to the filter coefficient and the fixed delay; receiving audio data sent by the video networking server through a downlink communication link, wherein the audio data is collected by the second video conference terminal; and executing echo cancellation operation on the audio data according to the target data volume. The embodiment of the invention can meet the requirement of an echo cancellation algorithm on the time sequence synchronization of the audio data, further realize the cancellation of the echo and improve the call quality in the video conference.

Description

Echo cancellation method and device

Technical Field

The present invention relates to the field of video networking technologies, and in particular, to an echo cancellation method and an echo cancellation device.

Background

Video conferencing, refers to a form of conferencing in which people in two or more locations have a face-to-face conversation via a communication device and a network. Generally, when a video conference is performed, a special video conference system needs to be established by using professional video conference equipment. By using the video conference system, the participants can hear the sound of other meeting places, see the image, the action and the expression of the participants in other meeting places, and also can send electronic demonstration contents, so that the participants have the feeling of being personally on the scene.

At present, in order to improve the call quality of a video conference, the audio collected and played at each conference site may be subjected to echo cancellation by using an echo cancellation algorithm. However, in a video conference system, audio acquisition and playing are non-real-time, and an echo cancellation algorithm has a high requirement on the timing synchronization of input audio streams, so that the acquired and played audio streams cannot be kept coherent in timing due to the non-real-time audio streams, and thus, the problem of poor synchronization is caused, and the call quality is affected. In addition, in an instant voice communication system, due to network instability, data packets may be lost on the network or dropped by other processing elements due to excessive delay. So that the amount of data played from the far end may be less than the amount of data generated by the near end recording acquisition. In this case, too, a serious problem of poor synchronism is caused. The problem of poor timing synchronization of the audio stream makes echo difficult to eliminate, which seriously affects the quality of voice communication.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed to provide an echo cancellation method and a corresponding echo cancellation device that overcome or at least partially solve the above problems.

In order to solve the above problems, an embodiment of the present invention discloses an echo cancellation method, which is applied to a video network, where the video network includes a video network server, a first video conference terminal, and a second video conference terminal; the method comprises the following steps:

The first video conference terminal determines a filter coefficient;

The first video conference terminal calculates the fixed time delay between the first video conference terminal and the second video conference terminal;

The first video conference terminal acquires the initial data volume of the reference data buffer area;

The first video conference terminal adjusts the initial data volume to a target data volume according to the filter coefficient and the fixed delay;

The first video conference terminal receives audio data sent by the video networking server through a downlink communication link, and the audio data is collected by the second video conference terminal;

And the first video conference terminal executes echo cancellation operation on the audio data according to the target data volume.

Optionally, the step of calculating, by the first video conference terminal, the fixed delay with the second video conference terminal includes:

Emptying the data amount of the reference data buffer;

Collecting and playing target audio data;

Respectively generating a first audio file and a second audio file according to the collected and played target audio data;

And calculating the fixed time delay between the first audio file and the second audio file.

Optionally, the step of adjusting, by the first video conference terminal, the initial data size to the target data size according to the filter coefficient and the fixed delay includes:

Determining the working delay corresponding to the filter coefficient;

Calculating a delay difference between the fixed delay and the working delay;

And adjusting the initial data volume to a target data volume according to the delay difference.

Optionally, the step of adjusting the initial data size to the target data size according to the delay difference includes:

When the delay difference is larger than zero, buffering data in the reference data buffer area, so that the buffered data volume is equal to the data volume corresponding to the delay difference;

And when the delay difference is less than zero, discarding partial data to ensure that the data quantity remained in the reference data buffer is equal to the data quantity corresponding to the delay difference.

Optionally, the step of performing, by the first video conference terminal, an echo cancellation operation on the audio data according to the target data amount includes:

Collecting local voice data when the audio data is played;

Transmitting the local voice data to an adaptive filter via the reference data buffer, the adaptive filter performing an echo cancellation operation on the local voice data.

In order to solve the above problems, an embodiment of the present invention discloses an echo cancellation device, which is applied to a video network, where the video network includes a video network server, a first video conference terminal, and a second video conference terminal; the device comprises:

The determining module is used for determining a filter coefficient of the first video conference terminal;

The calculation module is used for calculating the fixed time delay between the first video conference terminal and the second video conference terminal;

An obtaining module, configured to obtain an initial data amount of a reference data buffer of the first video conference terminal;

The adjusting module is used for adjusting the initial data volume to a target data volume according to the filter coefficient and the fixed delay;

The receiving module is used for receiving audio data sent by the video networking server through a downlink communication link, and the audio data is collected by the second video conference terminal;

And the execution module is used for executing echo cancellation operation on the audio data according to the target data volume.

Optionally, the calculation module comprises:

The data volume emptying submodule is used for emptying the data volume of the reference data buffer;

The target audio data acquisition and playing submodule is used for acquiring and playing target audio data;

The audio file generation submodule is used for respectively generating a first audio file and a second audio file according to the collected and played target audio data;

And the fixed time delay calculation sub-module is used for calculating the fixed time delay between the first audio file and the second audio file.

Optionally, the adjusting module includes:

The working delay determining submodule is used for determining the working delay corresponding to the filter coefficient;

The delay difference value calculating submodule is used for calculating a delay difference value between the fixed delay and the working delay;

And the target data volume adjusting submodule is used for adjusting the initial data volume to the target data volume according to the delay difference.

Optionally, the target data amount adjusting submodule includes:

The buffer unit is used for buffering data in the reference data buffer area when the delay difference is larger than zero, so that the buffered data volume is equal to the data volume corresponding to the delay difference;

And the discarding unit is used for discarding partial data when the delay difference is smaller than zero so as to enable the residual data volume in the reference data buffer to be equal to the data volume corresponding to the delay difference.

Optionally, the execution module includes:

The local voice data acquisition submodule is used for acquiring local voice data when the audio data is played;

And the local voice data transmission sub-module is used for transmitting the local voice data to an adaptive filter through the reference data buffer, and the adaptive filter performs echo cancellation operation on the local voice data.

Compared with the background art, the embodiment of the invention has the following advantages:

In the embodiment of the invention, the first video conference terminal can adjust the initial data volume in the reference data buffer area to the target data volume by determining the filter coefficient and the fixed time delay between the first video conference terminal and the second video conference terminal, so that after receiving the audio data which is collected by the second video conference terminal and sent by the video networking server, the first video conference terminal can execute echo cancellation operation on the audio data to eliminate echo. In the embodiment, by adjusting the data amount in the reference data buffer area, the fixed delay of the system can be close to the working delay corresponding to the filter coefficient, so that the time sequence between the reference data and the echo data can realize dynamic balance, the requirement of an echo cancellation algorithm on the time sequence synchronization of the audio data is met, the echo is eliminated, and the call quality in the video conference is improved.

Drawings

FIG. 1 is a flowchart illustrating steps of a first embodiment of an echo cancellation method according to the present invention;

FIG. 2 is a networking schematic of a video network of the present invention;

FIG. 3 is a schematic diagram of a hardware architecture of a node server according to the present invention;

Fig. 4 is a schematic diagram of a hardware structure of an access switch of the present invention;

Fig. 5 is a schematic diagram of a hardware structure of an ethernet protocol conversion gateway according to the present invention;

FIG. 6 is a flowchart illustrating the steps of a second embodiment of an echo cancellation method according to the present invention;

Fig. 7 is a block diagram of an echo cancellation device according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of a first embodiment of an echo cancellation method according to the present invention is shown, which may specifically include the following steps:

Step 101, a first video conference terminal determines a filter coefficient;

It should be noted that the method can be applied to video networking. The video networking is an important milestone for network development, and is a network system which can realize high-definition video transmission and push a plurality of internet applications to high-definition video and high-definition face-to-face.

The video networking adopts a real-time high-definition video exchange technology, can integrate required services such as dozens of services of video, voice, pictures, characters, communication, data and the like on a system platform on a network platform, such as high-definition video conference, video monitoring, intelligent monitoring analysis, emergency command, digital broadcast television, delayed television, network teaching, live broadcast, VOD on demand, television mail, Personal Video Recorder (PVR), intranet (self-office) channels, intelligent video broadcast control, information distribution and the like, and realizes high-definition quality video broadcast through a television or a computer.

In order to make the embodiments of the present invention better understood by those skilled in the art, a description of the video network is first provided below.

Some of the technologies applied in the video networking are as follows:

Network Technology (Network Technology): network technology innovation in video networking has improved over traditional Ethernet (Ethernet) to face the potentially enormous video traffic on the network. Unlike pure network Packet Switching (PacketSwitching) or network Circuit Switching (Circuit Switching), the Packet Switching is adopted by the technology of the video networking to meet the Streaming requirement. The video networking technology has the advantages of flexibility, simplicity and low price of packet switching, and simultaneously has the quality and safety guarantee of circuit switching, thereby realizing the seamless connection of the whole network switching type virtual circuit and the data format.

Switching Technology (Switching Technology): the video network adopts two advantages of asynchronism and packet switching of the Ethernet, eliminates the defects of the Ethernet on the premise of full compatibility, has end-to-end seamless connection of the whole network, is directly communicated with a user terminal, and directly bears an IP data packet. The user data does not require any format conversion across the entire network. The video networking is a higher-level form of the Ethernet, is a real-time exchange platform, can realize the real-time transmission of the whole-network large-scale high-definition video which cannot be realized by the existing Internet, and pushes a plurality of network video applications to high-definition and unification.

Server Technology (Server Technology): the server technology on the video networking and unified video platform is different from the traditional server, the streaming media transmission of the video networking and unified video platform is established on the basis of connection orientation, the data processing capacity of the video networking and unified video platform is independent of flow and communication time, and a single network layer can contain signaling and data transmission. For voice and video services, the complexity of video networking and unified video platform streaming media processing is much simpler than that of data processing, and the efficiency is greatly improved by more than one hundred times compared with that of a traditional server.

Storage Technology (Storage Technology): the super-high speed storage technology of the unified video platform adopts the most advanced real-time operating system in order to adapt to the media content with super-large capacity and super-large flow, the program information in the server instruction is mapped to the specific hard disk space, the media content is not passed through the server any more, and is directly sent to the user terminal instantly, and the general waiting time of the user is less than 0.2 second. The optimized sector distribution greatly reduces the mechanical motion of the magnetic head track seeking of the hard disk, the resource consumption only accounts for 20% of that of the IP internet of the same grade, but concurrent flow which is 3 times larger than that of the traditional hard disk array is generated, and the comprehensive efficiency is improved by more than 10 times.

Network Security Technology (Network Security Technology): the structural design of the video network completely eliminates the network security problem troubling the internet structurally by the modes of independent service permission control each time, complete isolation of equipment and user data and the like, generally does not need antivirus programs and firewalls, avoids the attack of hackers and viruses, and provides a structural carefree security network for users.

Service Innovation Technology (Service Innovation Technology): the unified video platform integrates services and transmission, and is not only automatically connected once whether a single user, a private network user or a network aggregate. The user terminal, the set-top box or the PC are directly connected to the unified video platform to obtain various multimedia video services in various forms. The unified video platform adopts a menu type configuration table mode to replace the traditional complex application programming, can realize complex application by using very few codes, and realizes infinite new service innovation.

Networking of the video network is as follows:

The video network is a centralized control network structure, and the network can be a tree network, a star network, a ring network and the like, but on the basis of the centralized control node, the whole network is controlled by the centralized control node in the network.

Fig. 2 is a schematic diagram of a video network according to the present invention. As can be seen from fig. 2, the video network is divided into two parts, an access network and a metropolitan network.

The devices of the access network part can be mainly classified into 3 types: node server, access switch, terminal (including various set-top boxes, coding boards, memories, etc.). The node server is connected to an access switch, which may be connected to a plurality of terminals and may be connected to an ethernet network.

The node server is a node which plays a centralized control function in the access network and can control the access switch and the terminal. The node server can be directly connected with the access switch or directly connected with the terminal.

Fig. 3 is a schematic diagram of a hardware structure of a node server according to the present invention. The node server mainly comprises a network interface module 301, a switching engine module 302, a CPU module 303 and a disk array module 304;

The network interface module 301, the CPU module 303, and the disk array module 304 all enter the switching engine module 302; the switching engine module 302 performs an operation of looking up the address table 305 on the incoming packet, thereby obtaining the direction information of the packet; and stores the packet in a queue of a corresponding packet buffer 306 according to the packet's steering information; if the queue of the packet buffer 306 is nearly full, it is discarded; the switching engine module 302 polls all packet buffer queues and forwards if the following conditions are met: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero. The disk array module 304 mainly implements control over the hard disk, including initialization, read-write, and other operations on the hard disk; the CPU module 303 is mainly responsible for protocol processing with an access switch and a terminal (not shown in the figure), configuring an address table 305 (including a downlink protocol packet address table, an uplink protocol packet address table, and a data packet address table), and configuring a disk array module 304.

Fig. 4 is a schematic diagram of a hardware structure of an access switch according to the present invention. The access switch mainly comprises a network interface module (a downlink network interface module 401 and an uplink network interface module 402), a switching engine module 403 and a CPU module 404;

Wherein, the packet (uplink data) coming from the downlink network interface module 401 enters the packet detection module 405; the packet detection module 405 detects whether the Destination Address (DA), the Source Address (SA), the packet type, and the packet length of the packet meet the requirements, if so, allocates a corresponding stream identifier (stream-id) and enters the switching engine module 403, otherwise, discards the stream identifier; the packet (downstream data) coming from the upstream network interface module 402 enters the switching engine module 403; the data packet coming from the CPU module 404 enters the switching engine module 403; the switching engine module 403 performs an operation of looking up the address table 406 on the incoming packet, thereby obtaining the direction information of the packet; if the packet entering the switching engine module 403 is from the downstream network interface to the upstream network interface, the packet is stored in the queue of the corresponding packet buffer 407 in association with the stream-id; if the queue of the packet buffer 407 is nearly full, it is discarded; if the packet entering the switching engine module 403 is not from the downlink network interface to the uplink network interface, the data packet is stored in the queue of the corresponding packet buffer 407 according to the packet guiding information; if the queue of the packet buffer 407 is nearly full, it is discarded.

The switching engine module 403 polls all packet buffer queues, which in this embodiment of the invention is divided into two cases:

If the queue is from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queued packet counter is greater than zero; 3) obtaining a token generated by a code rate control module;

If the queue is not from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero.

The rate control module 408 is configured by the CPU module 404 to generate tokens for packet buffer queues going to the upstream network interface from all downstream network interfaces at programmable intervals to control the rate of upstream forwarding.

The CPU module 404 is mainly responsible for protocol processing with the node server, configuration of the address table 406, and configuration of the code rate control module 408.

The apparatus of the access network portion further comprises an ethernet protocol gateway. Fig. 5 is a schematic diagram of a hardware structure of an ethernet protocol conversion gateway according to the present invention. The ethernet coordination gateway mainly includes a network interface module (a downlink network interface module 501 and an uplink network interface module 502), a switching engine module 503, a CPU module 504, a packet detection module 505, a code rate control module 508, an address table 506, a packet buffer 507, a MAC adding module 509, and a MAC deleting module 510.

Wherein, the data packet coming from the downlink network interface module 501 enters the packet detection module 505; the packet detection module 505 detects whether the ethernet MAC DA, the ethernet MAC SA, the ethernet length or frame type, the video network destination address DA, the video network source address SA, the video network packet type and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id); then, the MAC deleting module 510 subtracts MAC DA, MAC SA, length or frame type (2byte), and enters the corresponding receiving buffer, otherwise, discards it;

The downlink network interface module 501 detects the sending buffer of the port, and if there is a packet, obtains the ethernet MAC DA of the corresponding terminal according to the destination address DA of the packet, adds the ethernet MAC DA of the terminal, the MACSA of the ethernet coordination gateway, and the ethernet length or frame type, and sends the packet.

The other modules in the ethernet protocol gateway function similarly to the access switch.

The terminal of the access network part mainly comprises a network interface module, a service processing module and a CPU module; for example, the set-top box mainly comprises a network interface module, a video and audio coding and decoding engine module and a CPU module; the coding board mainly comprises a network interface module, a video and audio coding engine module and a CPU module; the memory mainly comprises a network interface module, a CPU module and a disk array module.

Similarly, devices of the metropolitan network portion may also be classified into 3 types: a metropolitan area server, a node switch and a node server. The metro server is connected to a node switch, which may be connected to a plurality of node servers. The node switch mainly comprises a network interface module, a switching engine module and a CPU module; the metropolitan area server mainly comprises a network interface module, a switching engine module and a CPU module.

The node server is a node server of the access network part, namely the node server belongs to both the access network part and the metropolitan area network part.

The metropolitan area server is a node which plays a centralized control function in the metropolitan area network and can control a node switch and a node server. The metropolitan area server can be directly connected with the node switch or directly connected with the node server.

Therefore, the whole video network is a network structure with layered centralized control, and the network controlled by the node server and the metropolitan area server can be in various structures such as tree, star and ring.

The access network part can form a unified video platform (the part in the dotted circle), and a plurality of unified video platforms can form a video network; each unified video platform may be interconnected via metropolitan area and wide area video networking.

The video networking data packets include access network data packets and metropolitan area network data packets.

The data packet of the access network mainly comprises the following parts: destination Address (DA), Source Address (SA), reserved byte, Payload (PDU), CRC.

As shown in the following table, the data packet of the access network mainly includes the following parts:

DA

SA

Reserved

Payload

CRC

The Destination Address (DA) is composed of 8 bytes (byte), the first byte represents the type of the data packet (e.g. various protocol packets, multicast data packet, unicast data packet, etc.), there are at most 256 possibilities, the second byte to the sixth byte are metropolitan area network addresses, and the seventh byte and the eighth byte are access network addresses.

The Source Address (SA) is also composed of 8 bytes (byte), defined as the same as the Destination Address (DA).

The reserved byte consists of 2 bytes.

The payload part has different lengths according to different types of datagrams, and is 64 bytes if it is a packet of various protocols, and is 32+1024 or 1056 bytes if it is a unicast packet, but is not limited to the above 2 types.

The CRC consists of 4 bytes and is calculated in accordance with the standard ethernet CRC algorithm.

The topology of a metropolitan area network is a graph and there may be 2, or even more than 2, connections between two devices, i.e., there may be more than 2 connections between a node switch and a node server, a node switch and a node switch, and a node switch and a node server. However, the metro network address of the metro network device is unique, and in order to accurately describe the connection relationship between the metro network devices, parameters are introduced into the video network: a label to uniquely describe a metropolitan area network device.

in the video network, the definition of the label is similar to that of a label of an MP L S (Multi-Protocol L abel Switch), and assuming that there are two connections between a device a and a device B, there are 2 labels for a packet from the device a to the device B, and there are 2 labels for a packet from the device B to the device a, the label is divided into an incoming label and an outgoing label, and assuming that the label (incoming label) of the packet entering the device a is 0x0000, the label (outgoing label) of the packet leaving the device a may become 0x 0001. the network entry process of the metro network is a network entry process under centralized control, that is, the address assignment and the label assignment of the metro network are both dominated by the metro server, the node Switch and the node server are both passively performed, which is different from the label assignment of the MP L S, and the label assignment of the MP L S is a result of mutual negotiation between the Switch and the server.

As shown in the following table, the data packet of the metro network mainly includes the following parts:

DA

SA

Reserved

Label (R)

Payload

CRC

Namely Destination Address (DA), Source Address (SA), Reserved byte (Reserved), tag, Payload (PDU), CRC. The format of the tag may be defined by reference to the following: the tag is 32 bits with the upper 16 bits reserved and only the lower 16 bits used, and its position is between the reserved bytes and payload of the packet.

In an embodiment of the present invention, the video network may include a video network server, a first video conference terminal, and a second video conference terminal.

The first video conference terminal and the second video conference terminal may be set top boxes (SetTopBox, STB), which are devices for connecting a tv set and an external signal source, and may convert the compressed digital signal into tv content and display the tv content on the tv set.

Generally, the set-top box may be connected to a camera and a microphone for collecting multimedia data such as video data and audio data, and may also be connected to a television for playing multimedia data such as video data and audio data.

In application scenes such as video conferences and the like, a first video conference terminal and a second video conference terminal are external signal sources, namely the first video conference terminal can collect multimedia data and send the multimedia data to the second video conference terminal through a video networking server, and the second video conference terminal plays the received multimedia data; meanwhile, the second video conference terminal can also collect multimedia data and send the multimedia data to the first video conference terminal through the video networking server, and the first video conference terminal plays the received multimedia data.

In this embodiment, the description is given by taking an example in which the second video conference terminal acquires audio data and sends the audio data to the first video conference terminal through the video networking server. It should be noted that, in a video conference scene, the operations performed by the first video conference terminal and the second video conference terminal should be the same, that is, when the first video conference terminal receives the audio data sent by the second video conference terminal and performs the echo cancellation operation on the audio data, the second video conference terminal may also perform the echo cancellation operation on the received audio data collected by the first video conference terminal by using the method of this embodiment.

In an embodiment of the present invention, before performing the echo cancellation operation, the first video conference terminal may first determine a filter coefficient. The filter coefficients may refer to coefficients of an adaptive filter that actually performs an echo cancellation operation on the received audio data in the first video conference terminal.

In a specific implementation, the filter coefficient may be fixed to a specific value, and after the filter coefficient is determined, the range of the working delay of the adaptive filter corresponding to the filter coefficient may be determined accordingly. The filter coefficient determines the convergence of the echo cancellation algorithm, and in practical applications, the algorithm is required to have fast convergence and stability, that is, the filter coefficient is required to be able to converge quickly and operate stably under the coefficient.

It should be noted that, a person skilled in the art may set the specific value of the filter coefficient according to actual needs, and the embodiment of the present invention does not limit this.

102, calculating a fixed time delay between the first video conference terminal and a second video conference terminal;

In the embodiment of the invention, the first video conference terminal, the second video conference terminal, the video networking server and other equipment can be jointly constructed into a video conference system. The fixed delay between the first video conference terminal and the second video conference terminal may refer to a fixed delay of a current video conference system.

In a specific implementation, the fixed delay time may be obtained by storing the collected and played audio data as an audio file in real time without buffering data of the first video conference terminal, and analyzing the audio file by an audio analysis tool.

103, the first video conference terminal acquires the initial data volume of the reference data buffer area;

In this embodiment of the present invention, the initial data amount may refer to an amount of data buffered in the reference data buffer before the first video conference terminal performs the echo cancellation operation.

Step 104, the first video conference terminal adjusts the initial data volume to a target data volume according to the filter coefficient and the fixed delay;

Generally, after receiving audio data collected by a second video conference terminal, a first video conference terminal needs to send the received audio data to a sound card and also needs to send the audio data to an echo cancellation algorithm for reference. The echo cancellation operation performed by the echo cancellation algorithm is the adaptive filtering process. Echo cancellation is mainly based on echo cancellation, i.e. the echo data is estimated by an adaptive method and then subtracted from the received signal to cancel the echo. This requires that the reference data arrives ahead of the echo data.

Therefore, the system delay can be changed by adjusting the size of the initial data volume in the reference data buffer, so that the fixed delay in the system is close to the optimal working delay corresponding to the filter coefficient, and the requirements are met.

For example, if the system fixed delay is 300ms and the optimal working delay corresponding to the filter coefficient is 200ms, in order to make the system fixed delay close to the algorithmic delay of the filter, 100ms more data can be buffered by increasing the data size in the reference data buffer.

Step 105, the first video conference terminal receives audio data sent by the video networking server through a downlink communication link, and the audio data is collected by the second video conference terminal;

In a specific implementation, during a video conference, a remote video conference terminal, that is, a second video conference terminal, may collect audio data and send the audio data to the video networking server through an uplink communication link, and after receiving the audio data, the video networking server first determines a destination address of the audio data, and then sends the audio data to the first video conference terminal through a downlink communication link. After receiving the audio data, the first video conference terminal, i.e., the local video conference terminal, may perform an echo cancellation operation on the audio data.

And 106, the first video conference terminal executes echo cancellation operation on the audio data according to the target data volume.

Usually, after audio data at a far end is played by a sound card of a local terminal, the audio data is collected again along with local voice through an echo path to form data, and echo to be eliminated by an echo elimination algorithm of an instant adaptive filter is eliminated.

For example, after the first video conference terminal plays the audio data sent by the second video conference terminal and plays the audio data locally, the sound played by the speaker can be transmitted into the microphone again through air propagation or wall reflection, and is collected again along with the local voice. Therefore, in order to improve the call quality in the video conference, the audio data should be eliminated as much as possible.

In a specific implementation, local voice data collected when audio data transmitted from a far end is played may be transmitted to an adaptive filter via a reference data buffer, and the adaptive filter performs echo cancellation processing, thereby canceling echo.

In the embodiment of the present invention, the first video conference terminal may adjust the initial data amount in the reference data buffer to the target data amount by determining the filter coefficient and the fixed delay between the first video conference terminal and the second video conference terminal, so that after receiving the audio data collected by the second video conference terminal and sent by the video networking server, the first video conference terminal may perform an echo cancellation operation on the audio data to cancel an echo. In the embodiment, by adjusting the data amount in the reference data buffer area, the fixed delay of the system can be close to the working delay corresponding to the filter coefficient, so that the time sequence between the reference data and the echo data can realize dynamic balance, the requirement of an echo cancellation algorithm on the time sequence synchronization of the audio data is met, the echo is eliminated, and the call quality in the video conference is improved.

Referring to fig. 6, a flowchart illustrating steps of a second embodiment of the echo cancellation method according to the present invention is shown, which may specifically include the following steps:

601, the first video conference terminal determines a filter coefficient;

It should be noted that the method can be applied to a video network, and the video network can include a video network server and a video conference terminal. The video conference terminal may include at least two, i.e., a first video conference terminal and a second video conference terminal.

One video conference terminal can collect multimedia data such as video data and audio data, the multimedia data is transmitted to the other video conference terminal through the video networking server, and the multimedia data is played on the received video conference terminal, so that a real-time video conference between at least two parties is realized.

In the embodiment of the present invention, a first video conference terminal is taken as a local video conference terminal, and a second video conference terminal is taken as a remote video conference terminal for example. That is, the first video conference terminal receives multimedia data such as video data and audio data transmitted by the second video conference terminal and plays the multimedia data locally, so as to realize a video conference between a local user and a far-end user. Certainly, in the video conference process, the local video conference terminal also collects multimedia data such as local video data and audio data in real time and transmits the multimedia data to the remote video conference terminal for playing, and the operation processes between the two parties are basically consistent.

Generally, when a first video conference terminal receives audio data transmitted by a second video conference terminal and plays the audio data through a loudspeaker, the played sound is transmitted to a microphone again through air propagation or wall reflection, and is collected again along with local voice. Therefore, in order to improve the call quality in the video conference, it is necessary to cancel this part of the echo.

In an embodiment of the present invention, the filter coefficients of the adaptive filter may be first determined before canceling the echo. In practice, after receiving the audio data of the far end, the first video conference terminal transmits the audio data to the adaptive filter, and the adaptive filter performs echo cancellation processing on the audio data through an echo cancellation algorithm.

Step 602, calculating a fixed time delay between the first video conference terminal and the second video conference terminal;

In the embodiment of the present invention, the fixed delay between the first video conference terminal and the second video conference terminal may be a fixed delay of a video conference system that is composed of different devices such as the plurality of video conference terminals.

In the embodiment of the invention, the collected and played audio data can be stored as the audio file in real time under the condition of not buffering the data of the first video conference terminal, and the fixed delay time can be obtained by analyzing the stored audio file through the audio analysis tool.

In a specific implementation, the data size of the reference data buffer may be emptied first, then the target audio data is collected and played, the first audio file and the second audio file are generated according to the collected and played target audio data, and the fixed delay between the first audio file and the second audio file is calculated by the audio analysis tool, so as to obtain the fixed delay of the system.

Step 603, acquiring the initial data volume of the reference data buffer;

Step 604, determining a working delay corresponding to the filter coefficient;

In the embodiment of the present invention, the operation delay may refer to an optimal operation delay corresponding to a filter coefficient of the adaptive filter. Typically, the operating delay is obtained after the filter coefficients are determined.

Step 605, calculating a delay difference between the fixed delay and the working delay;

For example, assuming that the fixed delay of the system is 300ms and the optimal operation delay corresponding to the filter coefficient is 200ms, the delay difference between the two is 100 ms.

Of course, the fixed delay of the system may be less than the optimal operating delay corresponding to the filter coefficients. For example, the fixed delay of the system is 200ms, the optimal working delay corresponding to the filter coefficient is 250ms, and the delay difference between the two is-50 ms.

Step 606, adjusting the initial data volume to a target data volume according to the delay difference;

In the embodiment of the invention, the initial data volume in the reference data buffer area is adjusted to the target data volume, and the system delay can be changed, so that the fixed delay in the system is close to the optimal working delay corresponding to the filter coefficient.

Therefore, in a specific implementation, when the delay difference is greater than zero, data may be buffered in the reference data buffer, so that the buffered data amount is equal to the data amount corresponding to the delay difference; when the delay difference is less than zero, part of the data may be discarded, so that the remaining data amount in the reference data buffer is equal to the data amount corresponding to the delay difference.

It should be noted that the target data amount in the adjusted buffer may not be completely equal to the data amount corresponding to the delay difference, but only needs to be within a certain range.

Step 607, receiving the audio data sent by the video networking server through the downlink communication link, where the audio data is collected by the second video conference terminal;

In a specific implementation, during a video conference, a remote video conference terminal, that is, a second video conference terminal, may collect audio data and send the audio data to the video networking server through an uplink communication link, and after receiving the audio data, the video networking server first determines a destination address of the audio data, and then sends the audio data to the first video conference terminal through a downlink communication link. After the first video conference terminal, i.e. the local video conference terminal, receives the audio data, step 608 and step 609 may be performed in sequence to perform an echo cancellation operation on the audio data.

Step 608, collecting local voice data when playing the audio data;

In the embodiment of the present invention, the collected local voice data is data that is re-transmitted to the microphone through air propagation or wall reflection when audio data transmitted by the second video conference terminal is played out through a speaker, and is re-collected along with the local voice, where the part of the data is an echo that needs to be removed by the adaptive filter when performing echo cancellation processing.

Step 609, transmitting the local voice data to an adaptive filter through the reference data buffer, and performing echo cancellation operation on the local voice data by the adaptive filter.

In embodiments of the present invention, local voice data may be transmitted to the adaptive filter via the reference data buffer. Since the amount of data in the reference data buffer has been adjusted, the timing between the reference data and the echo data can be synchronized. Therefore, the adaptive filter can effectively perform echo cancellation processing, thereby canceling echo.

For ease of understanding, the echo cancellation method of the present invention is described below with a specific example.

Take a certain video conference system as an example. First, the parameters in the audio system are as follows:

The audio frequency sampling rate is 32kHz, the sampling precision is 16bit, and a single sound channel is formed at the acquisition end;

The playing end, the audio sampling rate is 32kHz, the sampling precision is 16bit, and the double-channel sound source is provided.

When echo cancellation is carried out, parameters of reference data and collected data are required to be consistent, and the parameters are obtained after playing data are converted into single channels, namely, the audio sampling rate is 32kHz, the sampling precision is 16bit, and the single channels are obtained.

The code rate of audio acquisition is 512kbps, and the data volume of 1ms is 64B;

The code rate of audio playing is 1024kbps, and the data volume of 1ms is 128B.

In the following description, the amount of data is typically converted to an amount of time for ease of comparison with the delay requirements in the echo cancellation algorithm.

Secondly, defining the cache of audio acquisition and playing in the system:

Collecting and caching, namely collecting and using, and generally not exceeding the minimum data volume required by the next-level task;

The play buffer, usually referred to as sound card buffer, can control the data amount of the sound card buffer between 3kB and 6kB, i.e. between 24ms and 48ms, for the time sequence synchronization of echo cancellation, so as to minimize the delay jitter.

Firstly, correctly estimating fixed time delay in the current system working environment, and selecting a filter coefficient of an echo cancellation algorithm as a certain value, namely a time delay range of effective work of the algorithm;

For example, after a plurality of samples are estimated under different working environments, the fixed delay of the audio system of the video conference terminal is 200ms to 300ms, and the coefficients of the filter in the echo cancellation algorithm can be selected to have a value such that the effective working delay is 150ms to 250 ms.

Secondly, by adjusting the data volume in the reference data buffer area, the delay in the audio system is close to the optimal working delay corresponding to the coefficient of the filter in the echo cancellation algorithm;

The played two-channel data is transmitted to the sound card, copied and converted into single-channel data and transmitted to the reference data buffer area. Assuming that the fixed delay of the current system is 250ms, the initial data amount in the reference data buffer is set to be 100 ms. At this time, no data exists in the acquisition buffer area, so that the difference value of the data quantity of the reference data buffer area and the acquisition data buffer area is 100ms of data quantity, the system delay is changed into 150ms, and the requirement of effective work delay of an algorithm is met.

It should be noted that the size of the initial data amount in the reference data buffer has a correlation with the system delay. The reference data should be fed into the echo cancellation algorithm before the echo data, and the timing difference between the reference data and the echo data is smaller, i.e. the delay is smaller, the larger the initial data amount of the reference data buffer is, the more delayed the timing of feeding the reference data into the algorithm is. Due to the fact that factors such as audio scheduling, network packet loss and multi-channel audio mixing cause the situation that playing data is cut off and reference data cannot be supplemented, data in a reference data buffer area is consumed, and delay always tends to become larger gradually. Thus, the data amount setting principle of the reference data buffer is: a larger buffer area is set as much as possible, so that the echo cancellation algorithm works in a smaller delay interval, and the algorithm failure caused by the increase of data jitter delay is avoided. Since the size of the data amount in the reference data buffer is dynamically set, it can be abstracted into a parameter in the application for setting the size of the initial data amount of the buffer.

And thirdly, establishing a data synchronization mechanism to enable the reference data and the collected data to achieve dynamic balance under instant communication.

The final purpose of synchronization is to make the delay of the reference data and the acquired data as close as possible to the delay of the algorithm and keep stable, so as to uniformly process the data through the echo cancellation algorithm to realize the cancellation of the echo. It is necessary to strictly control various links and external factors in the audio system, including fixed delay of system operation, sound card play cache, reference data cache, collected data cache, network packet loss, and the like.

After the system fixed delay, the algorithm delay and the initial data size in the reference data buffer are confirmed to meet the matching of the system delay and the algorithm delay, the initial condition of the synchronization mechanism is established.

In practical applications, on one hand, with network jitter, audio data scheduling, multi-channel audio mixing and other factors, the situation of cut-off occurs in the played data, and the initially established synchronization condition is seriously damaged by sudden reduction of the reference data. In this case, the play data needs to be supplemented in time, and the reference data buffer can be filled with mute data in general. On the other hand, due to network delay or system congestion, data is automatically supplemented first, and then a large amount of data is received in a short time, so that redundancy is caused, and the initially established synchronization condition is also damaged. In this case, redundant data needs to be discarded. This requires that, in addition to the initial condition that the synchronization mechanism is established, a data balancing mechanism is established to establish a lower limit for the amount of data that needs to be supplemented by the reference data buffer and an upper limit for the amount of data that needs to be discarded. The upper and lower limits refer to the difference between the amount of data in the reference data buffer and the amount of data in the collected data buffer. This limit is related to the performance of the adaptive filter in the echo cancellation algorithm, i.e. the algorithm delay corresponding to the coefficients of the filter and the echo estimation range.

For example, the algorithm delay corresponding to the coefficient of the filter is 200ms, the echo estimation range is ± 50ms, the corresponding effective working delay range of the filter is 150 to 250ms, and the difference between the upper limit and the lower limit of the data amount of the reference data buffer may be 100 ms.

Therefore, data can be supplemented or discarded according to the data quantity of the lower limit critical value and the upper limit critical value, so that the dynamic balance of the reference data and the collected data is achieved under the instant communication, and the time sequence synchronization is realized.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 7, a block diagram of an embodiment of an echo cancellation device according to the present invention is shown, where the device may be applied to a video network, where the video network may include a video network server, a first video conference terminal, and a second video conference terminal, and the device may specifically include the following modules:

A determining module 701, configured to determine a filter coefficient of the first video conference terminal;

A calculating module 702, configured to calculate a fixed delay between the first video conference terminal and the second video conference terminal;

An obtaining module 703, configured to obtain an initial data amount of a reference data buffer of the first video conference terminal;

An adjusting module 704, configured to adjust the initial data size to a target data size according to the filter coefficient and the fixed delay;

A receiving module 705, configured to receive audio data sent by the video networking server through a downlink communication link, where the audio data may be collected by the second video conference terminal;

An executing module 706, configured to execute an echo cancellation operation on the audio data according to the target data amount.

In this embodiment of the present invention, the calculating module 702 may specifically include the following sub-modules:

In this embodiment of the present invention, the adjusting module 704 may specifically include the following sub-modules:

In this embodiment of the present invention, the target data amount adjusting submodule may specifically include the following units:

In this embodiment of the present invention, the execution module 706 specifically includes the following sub-modules:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The echo cancellation method and the echo cancellation device provided by the present invention are described in detail above, and the principle and the implementation of the present invention are explained in the present document by applying specific examples, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. The echo cancellation method is applied to a video network, wherein the video network comprises a video network server, a first video conference terminal and a second video conference terminal; the method comprises the following steps:

The first video conference terminal determines a filter coefficient;

The first video conference terminal calculates the fixed time delay between the first video conference terminal and the second video conference terminal; wherein the fixed delay is the fixed delay of the video conference system;

The first video conference terminal executes echo cancellation operation on the audio data according to the target data volume;

The step of executing echo cancellation operation on the audio data by the first video conference terminal according to the target data volume comprises the following steps:

Collecting local voice data when the audio data is played;

2. The method of claim 1, wherein the step of the first video conference terminal calculating the fixed delay time with the second video conference terminal comprises:

Emptying the data amount of the reference data buffer;

Collecting and playing target audio data;

3. The method according to claim 1 or 2, wherein the step of the first video conference terminal adjusting the initial data amount to the target data amount according to the filter coefficient and the fixed delay time comprises:

Determining the working delay corresponding to the filter coefficient;

Calculating a delay difference between the fixed delay and the working delay;

4. The method of claim 3, wherein the step of adjusting the initial data amount to the target data amount according to the delay difference comprises:

5. An echo cancellation device is applied to a video network, wherein the video network comprises a video network server, a first video conference terminal and a second video conference terminal; the device comprises:

The calculation module is used for calculating the fixed time delay between the first video conference terminal and the second video conference terminal; wherein the fixed delay is the fixed delay of the video conference system;

The execution module is used for executing echo cancellation operation on the audio data according to the target data volume;

The execution module comprises:

6. The apparatus of claim 5, wherein the computing module comprises:

7. The apparatus of claim 5 or 6, wherein the adjustment module comprises:

8. The apparatus of claim 7, wherein the target data amount adjustment submodule comprises: