CN109147812B

CN109147812B - Echo cancellation method and device

Info

Publication number: CN109147812B
Application number: CN201811095842.4A
Authority: CN
Inventors: 靳伟明; 牛永会; 王艳辉; 刘苹苹
Original assignee: Visionvera Information Technology Co Ltd
Current assignee: Hainan Qiantang Shilian Information Technology Co.,Ltd.
Priority date: 2018-09-19
Filing date: 2018-09-19
Publication date: 2020-09-08
Anticipated expiration: 2038-09-19
Also published as: CN109147812A

Abstract

The invention provides an echo cancellation method and device, which are applied to video networking. The method comprises the following steps: the method comprises the steps that a first video network terminal receives first audio data sent by a video network server according to a first downlink communication link configured for the first video network terminal; storing the first audio data into a preset play queue, and playing the first audio data; acquiring second audio data, extracting first audio data from the play queue, and performing echo cancellation on the second audio data by using the first audio data to obtain third audio data; and sending the third audio data to the video networking server based on the video networking protocol, and sending the third audio data to the second video networking terminal by the video networking server according to a second downlink communication link configured for the second video networking terminal. The invention can eliminate the echo in the two-way communication in a software mode, has simple method and lower cost, is based on the video networking protocol in the data transmission process, has faster transmission and improves the real-time property of the communication.

Description

Echo cancellation method and device

Technical Field

The present invention relates to the field of video networking technologies, and in particular, to an echo cancellation method and an echo cancellation device.

Background

In conventional telephone systems, there is a "circuit-back" phenomenon. The main reason for this echo is that the mixer "leaks" due to impedance matching when the 2/4 conversion is complete, resulting in a "circuit echo". Where the echo of that IP phone came from? On the one hand, when an IP Telephone system is interconnected with a PSTN (Public Switched Telephone Network), the 2/4 wire switching circuit involving the hybrid coil generates an echo. On the other hand, voice data of IP phones also have "acoustic echo" during transmission.

The acoustic echo refers to the sound played by a loudspeaker which is picked up by a microphone and sent back to a far-end, so that a far-end talker can hear the sound of the far-end talker. Acoustic echoes are further classified into direct echoes and indirect echoes. Direct echo means that sound played by a loudspeaker enters a microphone directly without any reflection. The echo delay is the shortest, and is related to the voice energy of a far-end speaker, the distance and the angle between the speaker and a microphone, the playing volume of the speaker, the pick-up sensitivity of the microphone and other factors; indirect echo refers to an echo set generated when sound played by a speaker enters a microphone after being reflected once or multiple times through different paths. When the echo return time exceeds 10ms, the human ear can hear obvious echo, and normal conversation can be disturbed. For an IP network environment with relatively large delay, the delay can easily reach 50 ms. Therefore, the echo can seriously affect the communication process of people and reduce the user experience.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed to provide an echo cancellation method and a corresponding echo cancellation device that overcome or at least partially solve the above problems.

In order to solve the above problem, an embodiment of the present invention discloses an echo cancellation method, where the method is applied in a video network, and the method includes:

a first video network terminal receives first audio data sent by a video network server according to a first downlink communication link configured for the first video network terminal; the first audio data is collected by a second video network terminal and is sent to the video network server;

the first video network terminal stores the first audio data into a preset play queue and plays the first audio data;

the first video networking terminal collects second audio data, extracts the first audio data from the play queue, and performs echo cancellation on the second audio data by using the first audio data to obtain third audio data;

and the first video networking terminal sends the third audio data to the video networking server based on a video networking protocol, and the video networking server sends the third audio data to the second video networking terminal according to a second downlink communication link configured for the second video networking terminal.

Preferably, the step of acquiring, by the first video network terminal, second audio data, extracting the first audio data from the play queue, and performing echo cancellation on the second audio data by using the first audio data to obtain third audio data includes: the first video network terminal collects second audio data and positions the second audio data to a position with a set frame number delayed; and the first video networking terminal extracts first audio data from the play queue, and performs echo cancellation on second audio data starting from the position by using the first audio data to obtain third audio data.

Preferably, the method further comprises: and if the first video network terminal judges that the frame number stored in the play queue is not in the set frame number range, adjusting the frame number stored in the play queue to be in the set frame number range.

Preferably, the step of adjusting the frame number of the first audio data to be within the set frame number range includes: if the frame number stored in the play queue is smaller than the minimum frame number of the set frame number range, adding mute data into the play queue to enable the frame number stored in the play queue to be within the set frame number range; if the frame number stored in the play queue is larger than the maximum frame number in the set frame number range, deleting part of the first audio data from the play queue to enable the frame number stored in the play queue to be in the set frame number range.

Preferably, the set frame number ranges from 5 to 15 frames.

On the other hand, the embodiment of the invention also discloses an echo cancellation device, which is applied to a first video network terminal of the video network, and comprises the following components:

the receiving module is used for receiving first audio data sent by a video networking server according to a first downlink communication link configured for the first video networking terminal; the first audio data is collected by a second video network terminal and is sent to the video network server;

the playing module is used for storing the first audio data into a preset playing queue and playing the first audio data;

the eliminating module is used for acquiring second audio data, extracting the first audio data from the play queue, and performing echo elimination on the second audio data by using the first audio data to obtain third audio data;

and the sending module is used for sending the third audio data to the video networking server based on a video networking protocol, and sending the third audio data to the second video networking terminal by the video networking server according to a second downlink communication link configured for the second video networking terminal.

Preferably, the eliminating module comprises: the data positioning unit is used for acquiring second audio data and positioning the second audio data to a position with a delay set frame number; and the echo cancellation unit is used for extracting first audio data from the play queue, and performing echo cancellation on second audio data starting from the position by using the first audio data to obtain third audio data.

Preferably, the apparatus further comprises: and the adjusting module is used for adjusting the frame number stored in the play queue to the set frame number range if the frame number stored in the play queue is judged not to be in the set frame number range.

Preferably, the adjusting module comprises: a first adjusting unit, configured to add mute data to the play queue if the number of frames stored in the play queue is smaller than the minimum number of frames in the set frame number range, so that the number of frames stored in the play queue is within the set frame number range; and the second adjusting unit is used for deleting part of the first audio data from the play queue if the number of frames stored in the play queue is greater than the maximum number of frames in the set frame number range, so that the number of frames stored in the play queue is in the set frame number range.

Preferably, the set frame number ranges from 5 to 15 frames.

In the embodiment of the invention, a first video network terminal receives first audio data sent by a video network server according to a first downlink communication link configured for the first video network terminal, and the first audio data is collected by a second video network terminal and sent to the video network server; the first video network terminal stores the first audio data into a preset play queue and plays the first audio data; the first video networking terminal collects second audio data, extracts first audio data from the play queue, and performs echo cancellation on the second audio data by using the first audio data to obtain third audio data; and the first video networking terminal sends the third audio data to the video networking server based on the video networking protocol, and the video networking server sends the third audio data to the second video networking terminal according to a second downlink communication link configured for the second video networking terminal. Therefore, the echo in the two-way communication can be eliminated in a software mode, the method is simple and convenient, and the cost is low; and the data transmission process is based on a video networking protocol, so that the transmission is faster, and the real-time performance of communication is improved.

Drawings

FIG. 1 is a schematic networking diagram of a video network of the present invention;

FIG. 2 is a schematic diagram of a hardware architecture of a node server according to the present invention;

fig. 3 is a schematic diagram of a hardware structure of an access switch of the present invention;

fig. 4 is a schematic diagram of a hardware structure of an ethernet protocol conversion gateway according to the present invention;

fig. 5 is a flowchart illustrating steps of an echo cancellation method according to a first embodiment of the present invention;

fig. 6 is a block diagram of an echo cancellation device according to a second embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The video networking is an important milestone for network development, is a real-time network, can realize high-definition video real-time transmission, and pushes a plurality of internet applications to high-definition video, and high-definition faces each other.

The video networking adopts a real-time high-definition video exchange technology, can integrate required services such as dozens of services of video, voice, pictures, characters, communication, data and the like on a system platform on a network platform, such as high-definition video conference, video monitoring, intelligent monitoring analysis, emergency command, digital broadcast television, delayed television, network teaching, live broadcast, VOD on demand, television mail, Personal Video Recorder (PVR), intranet (self-office) channels, intelligent video broadcast control, information distribution and the like, and realizes high-definition quality video broadcast through a television or a computer.

To better understand the embodiments of the present invention, the following description refers to the internet of view:

some of the technologies applied in the video networking are as follows:

network Technology (Network Technology)

Network technology innovation in video networking has improved over traditional Ethernet (Ethernet) to face the potentially enormous video traffic on the network. Unlike pure network Packet Switching (Packet Switching) or network circuit Switching (circuit Switching), the Packet Switching is adopted by the technology of the video networking to meet the Streaming requirement. The video networking technology has the advantages of flexibility, simplicity and low price of packet switching, and simultaneously has the quality and safety guarantee of circuit switching, thereby realizing the seamless connection of the whole network switching type virtual circuit and the data format.

Switching Technology (Switching Technology)

The video network adopts two advantages of asynchronism and packet switching of the Ethernet, eliminates the defects of the Ethernet on the premise of full compatibility, has end-to-end seamless connection of the whole network, is directly communicated with a user terminal, and directly bears an IP data packet. The user data does not require any format conversion across the entire network. The video networking is a higher-level form of the Ethernet, is a real-time exchange platform, can realize the real-time transmission of the whole-network large-scale high-definition video which cannot be realized by the existing Internet, and pushes a plurality of network video applications to high-definition and unification.

Server Technology (Server Technology)

The server technology on the video networking and unified video platform is different from the traditional server, the streaming media transmission of the video networking and unified video platform is established on the basis of connection orientation, the data processing capacity of the video networking and unified video platform is independent of flow and communication time, and a single network layer can contain signaling and data transmission. For voice and video services, the complexity of video networking and unified video platform streaming media processing is much simpler than that of data processing, and the efficiency is greatly improved by more than one hundred times compared with that of a traditional server.

Storage Technology (Storage Technology)

The super-high speed storage technology of the unified video platform adopts the most advanced real-time operating system in order to adapt to the media content with super-large capacity and super-large flow, the program information in the server instruction is mapped to the specific hard disk space, the media content is not passed through the server any more, and is directly sent to the user terminal instantly, and the general waiting time of the user is less than 0.2 second. The optimized sector distribution greatly reduces the mechanical motion of the magnetic head track seeking of the hard disk, the resource consumption only accounts for 20% of that of the IP internet of the same grade, but concurrent flow which is 3 times larger than that of the traditional hard disk array is generated, and the comprehensive efficiency is improved by more than 10 times.

Network Security Technology (Network Security Technology)

The structural design of the video network completely eliminates the network security problem troubling the internet structurally by the modes of independent service permission control each time, complete isolation of equipment and user data and the like, generally does not need antivirus programs and firewalls, avoids the attack of hackers and viruses, and provides a structural carefree security network for users.

Service Innovation Technology (Service Innovation Technology)

The unified video platform integrates services and transmission, and is not only automatically connected once whether a single user, a private network user or a network aggregate. The user terminal, the set-top box or the PC are directly connected to the unified video platform to obtain various multimedia video services in various forms. The unified video platform adopts a menu type configuration table mode to replace the traditional complex application programming, can realize complex application by using very few codes, and realizes infinite new service innovation.

Networking of the video network is as follows:

the video network is a centralized control network structure, and the network can be a tree network, a star network, a ring network and the like, but on the basis of the centralized control node, the whole network is controlled by the centralized control node in the network.

As shown in fig. 1, the video network is divided into an access network and a metropolitan network.

The devices of the access network part can be mainly classified into 3 types: node server, access switch, terminal (including various set-top boxes, coding boards, memories, etc.). The node server is connected to an access switch, which may be connected to a plurality of terminals and may be connected to an ethernet network.

The node server is a node which plays a centralized control function in the access network and can control the access switch and the terminal. The node server can be directly connected with the access switch or directly connected with the terminal.

Similarly, devices of the metropolitan network portion may also be classified into 3 types: a metropolitan area server, a node switch and a node server. The metro server is connected to a node switch, which may be connected to a plurality of node servers.

The node server is a node server of the access network part, namely the node server belongs to both the access network part and the metropolitan area network part.

The metropolitan area server is a node which plays a centralized control function in the metropolitan area network and can control a node switch and a node server. The metropolitan area server can be directly connected with the node switch or directly connected with the node server.

Therefore, the whole video network is a network structure with layered centralized control, and the network controlled by the node server and the metropolitan area server can be in various structures such as tree, star and ring.

The access network part can form a unified video platform (the part in the dotted circle), and a plurality of unified video platforms can form a video network; each unified video platform may be interconnected via metropolitan area and wide area video networking.

Video networking device classification

1.1 devices in the video network of the embodiment of the present invention can be mainly classified into 3 types: servers, switches (including ethernet gateways), terminals (including various set-top boxes, code boards, memories, etc.). The video network as a whole can be divided into a metropolitan area network (or national network, global network, etc.) and an access network.

1.2 wherein the devices of the access network part can be mainly classified into 3 types: node servers, access switches (including ethernet gateways), terminals (including various set-top boxes, code boards, memories, etc.).

The specific hardware structure of each access network device is as follows:

a node server:

as shown in fig. 2, the system mainly includes a network interface module 201, a switching engine module 202, a CPU module 203, and a disk array module 204;

the network interface module 201, the CPU module 203, and the disk array module 204 all enter the switching engine module 202; the switching engine module 202 performs an operation of looking up the address table 205 on the incoming packet, thereby obtaining the direction information of the packet; and stores the packet in a queue of the corresponding packet buffer 206 based on the packet's steering information; if the queue of the packet buffer 206 is nearly full, it is discarded; the switching engine module 202 polls all packet buffer queues for forwarding if the following conditions are met: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero. The disk array module 204 mainly implements control over the hard disk, including initialization, read-write, and other operations on the hard disk; the CPU module 203 is mainly responsible for protocol processing with an access switch and a terminal (not shown in the figure), configuring an address table 205 (including a downlink protocol packet address table, an uplink protocol packet address table, and a data packet address table), and configuring the disk array module 204.

The access switch:

as shown in fig. 3, the network interface module mainly includes a network interface module (a downlink network interface module 301 and an uplink network interface module 302), a switching engine module 303 and a CPU module 304;

wherein, the packet (uplink data) coming from the downlink network interface module 301 enters the packet detection module 305; the packet detection module 305 detects whether the Destination Address (DA), the Source Address (SA), the packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id) and enters the switching engine module 303, otherwise, discards the stream identifier; the packet (downstream data) coming from the upstream network interface module 302 enters the switching engine module 303; the data packet coming from the CPU module 204 enters the switching engine module 303; the switching engine module 303 performs an operation of looking up the address table 306 on the incoming packet, thereby obtaining the direction information of the packet; if the packet entering the switching engine module 303 is from the downstream network interface to the upstream network interface, the packet is stored in the queue of the corresponding packet buffer 307 in association with the stream-id; if the queue of the packet buffer 307 is nearly full, it is discarded; if the packet entering the switching engine module 303 is not from the downlink network interface to the uplink network interface, the data packet is stored in the queue of the corresponding packet buffer 307 according to the guiding information of the packet; if the queue of the packet buffer 307 is nearly full, it is discarded.

The switching engine module 303 polls all packet buffer queues, which in this embodiment of the present invention is divided into two cases:

if the queue is from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queued packet counter is greater than zero; 3) obtaining a token generated by a code rate control module;

if the queue is not from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero.

The rate control module 208 is configured by the CPU module 204, and generates tokens for packet buffer queues from all downstream network interfaces to upstream network interfaces at programmable intervals to control the rate of upstream forwarding.

The CPU module 304 is mainly responsible for protocol processing with the node server, configuration of the address table 306, and configuration of the code rate control module 308.

Ethernet protocol conversion gateway：

As shown in fig. 4, the apparatus mainly includes a network interface module (a downlink network interface module 401 and an uplink network interface module 402), a switching engine module 403, a CPU module 404, a packet detection module 405, a rate control module 408, an address table 406, a packet buffer 407, a MAC adding module 409, and a MAC deleting module 410.

Wherein, the data packet coming from the downlink network interface module 401 enters the packet detection module 405; the packet detection module 405 detects whether the ethernet MAC DA, the ethernet MAC SA, the ethernet length or frame type, the video network destination address DA, the video network source address SA, the video network packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id); then, the MAC deletion module 410 subtracts MAC DA, MAC SA, length or frame type (2byte) and enters the corresponding receiving buffer, otherwise, discards it;

the downlink network interface module 401 detects the sending buffer of the port, and if there is a packet, acquires the ethernet MAC DA of the corresponding terminal according to the video networking destination address DA of the packet, adds the ethernet MAC DA of the terminal, the MACSA of the ethernet coordination gateway, and the ethernet length or frame type, and sends the packet.

The other modules in the ethernet protocol gateway function similarly to the access switch.

A terminal:

the system mainly comprises a network interface module, a service processing module and a CPU module; for example, the set-top box mainly comprises a network interface module, a video and audio coding and decoding engine module and a CPU module; the coding board mainly comprises a network interface module, a video and audio coding engine module and a CPU module; the memory mainly comprises a network interface module, a CPU module and a disk array module.

1.3 devices of the metropolitan area network part can be mainly classified into 2 types: node server, node exchanger, metropolitan area server. The node switch mainly comprises a network interface module, a switching engine module and a CPU module; the metropolitan area server mainly comprises a network interface module, a switching engine module and a CPU module.

2. Video networking packet definition

2.1 Access network packet definition

The data packet of the access network mainly comprises the following parts: destination Address (DA), Source Address (SA), reserved bytes, payload (pdu), CRC.

As shown in the following table, the data packet of the access network mainly includes the following parts:

DA

SA

Reserved

Payload

CRC

wherein:

the Destination Address (DA) is composed of 8 bytes (byte), the first byte represents the type of the data packet (such as various protocol packets, multicast data packets, unicast data packets, etc.), there are 256 possibilities at most, the second byte to the sixth byte are metropolitan area network addresses, and the seventh byte and the eighth byte are access network addresses;

the Source Address (SA) is also composed of 8 bytes (byte), defined as the same as the Destination Address (DA);

the reserved byte consists of 2 bytes;

the payload part has different lengths according to different types of data packets, and is 64 bytes if the data packet is a variety of protocol packets, and is 32+1024 or 1056 bytes if the data packet is a unicast data packet, of course, the length is not limited to the above 2 types;

the CRC consists of 4 bytes and is calculated in accordance with the standard ethernet CRC algorithm.

2.2 metropolitan area network packet definition

The topology of a metropolitan area network is a graph and there may be 2, or even more than 2, connections between two devices, i.e., there may be more than 2 connections between a node switch and a node server, a node switch and a node switch, and a node switch and a node server. However, the metro network address of the metro network device is unique, and in order to accurately describe the connection relationship between the metro network devices, parameters are introduced in the embodiment of the present invention: a label to uniquely describe a metropolitan area network device.

In this specification, the definition of the Label is similar to that of the Label of MPLS (Multi-Protocol Label Switch), and assuming that there are two connections between the device a and the device B, there are 2 labels for the packet from the device a to the device B, and 2 labels for the packet from the device B to the device a. The label is classified into an incoming label and an outgoing label, and assuming that the label (incoming label) of the packet entering the device a is 0x0000, the label (outgoing label) of the packet leaving the device a may become 0x 0001. The network access process of the metro network is a network access process under centralized control, that is, address allocation and label allocation of the metro network are both dominated by the metro server, and the node switch and the node server are both passively executed, which is different from label allocation of MPLS, and label allocation of MPLS is a result of mutual negotiation between the switch and the server.

As shown in the following table, the data packet of the metro network mainly includes the following parts:

DA

SA

Reserved

label (R)

Payload

CRC

Namely Destination Address (DA), Source Address (SA), Reserved byte (Reserved), tag, payload (pdu), CRC. The format of the tag may be defined by reference to the following: the tag is 32 bits with the upper 16 bits reserved and only the lower 16 bits used, and its position is between the reserved bytes and payload of the packet.

Based on the characteristics of the video networking, the echo cancellation scheme provided by the embodiment of the invention follows the protocol of the video networking, and utilizes a software method to cancel the echo in the video networking communication, so that the method is simple and convenient, and the user experience is improved.

Example one

The echo cancellation method of the embodiment of the invention can be applied to video networking. The video network terminal and the video network server can be included in the video network, and the video network terminal needs to be registered on the video network server to perform normal service. The video network server may be the above node server. The video network terminal is a service landing device on the video network, and an actual participant or a server of the video network service. The video network terminal can be a hardware terminal, such as various conference set-top boxes, video telephone set-top boxes, operation teaching set-top boxes, streaming media gateways, storage gateways, media synthesizers, and the like. The video network terminal can also be a software terminal, such as a Windows software terminal, and the like.

Referring to fig. 5, a flowchart illustrating steps of an echo cancellation method according to a first embodiment of the present invention is shown.

The echo cancellation method of the embodiment of the invention can comprise the following steps:

step 501, a first video network terminal receives first audio data sent by a video network server according to a first downlink communication link configured for the first video network terminal.

The echo cancellation method provided by the embodiment of the invention can be applied to video conference, video call and other two-way communication based on video networking. The video networking terminal described in the embodiment of the invention can be a Windows software terminal. The first video network terminal and the second video network terminal are two parties in communication, and the first video network terminal and the second video network terminal are opposite terminals. The first video network terminal and the second video network terminal can be connected with equipment such as a sound box and a microphone. The microphone is used for collecting external sound, and the sound box is used for playing the sound.

In the embodiment of the present invention, after the second video network terminal starts to perform bidirectional communication such as video conference, video call, and the like, external sound may be collected by the microphone, for example, when a user of the second video network terminal speaks, the second video network terminal may collect the voice spoken by the user by the microphone. And the second video network terminal encodes the collected sound to obtain first audio data. For example, the first Audio data in AAC (Advanced Audio Coding) format, the first Audio data in G711 format, and so on may be encoded.

And after the coding, the second video network terminal sends the first audio data to the video network server. In a specific implementation, the second video networking terminal may encapsulate the first audio data into a first video networking protocol data packet based on a video networking protocol, and send the first video networking protocol data packet to the video networking server through the video networking.

The video networking protocol may be a protocol for processing audio data in video networking, such as the 2001 protocol. 2001 protocol may be specified as shown in the following table:

after receiving the first audio data sent by the second video network terminal, the video network server forwards the first audio data to the first video network terminal, namely forwards the received first video network protocol data packet in which the first audio data is encapsulated to the first video network terminal.

In the embodiment of the present invention, the video network server may send the first audio data to the first video network terminal according to the first downlink communication link configured for the first video network terminal.

In practical applications, the video network is a network with a centralized control function, and includes a master control server and a lower level network device, where the lower level network device includes a terminal, and one of the core concepts of the video network is to configure a table for a downlink communication link of a current service by notifying a switching device by the master control server, and then transmit a data packet based on the configured table.

Namely, the communication method in the video network includes:

and the master control server configures the downlink communication link of the current service.

And transmitting the data packet of the current service sent by the source terminal (such as the second video network terminal) to the target terminal (such as the first video network terminal) according to the downlink communication link.

In the embodiment of the present invention, configuring the downlink communication link of the current service includes: and informing the switching equipment related to the downlink communication link of the current service to allocate the table.

Further, transmitting according to the downlink communication link includes: the configured table is consulted, and the switching equipment transmits the received data packet through the corresponding port.

In particular implementations, the services include unicast communication services and multicast communication services. Namely, whether multicast communication or unicast communication, the core concept of the table matching-table can be adopted to realize communication in the video network.

As mentioned above, the video network includes an access network portion, in which the master server is a node server and the lower-level network devices include an access switch and a terminal.

For the unicast communication service in the access network, the step of configuring the downlink communication link of the current service by the master server may include the following steps:

and a substep S11, the main control server obtains the downlink communication link information of the current service according to the service request protocol packet initiated by the source terminal, wherein the downlink communication link information includes the downlink communication port information of the main control server and the access switch participating in the current service.

In the substep S12, the main control server sets a downlink port to which a packet of the current service is directed in a packet address table inside the main control server according to the downlink communication port information of the control server; and sending a port configuration command to the corresponding access switch according to the downlink communication port information of the access switch.

In sub-step S13, the access switch sets the downstream port to which the packet of the current service is directed in its internal packet address table according to the port configuration command.

For a multicast communication service (e.g., video conference) in the access network, the step of the master server obtaining downlink information of the current service may include the following sub-steps:

in sub-step S21, the main control server obtains a service request protocol packet initiated by the target terminal and applying for the multicast communication service, where the service request protocol packet includes service type information, service content information, and an access network address of the target terminal.

Wherein, the service content information includes a service number.

And a substep S22, the main control server extracts the access network address of the source terminal in a preset content-address mapping table according to the service number.

In the substep of S23, the main control server obtains the multicast address corresponding to the source terminal and distributes the multicast address to the target terminal; and acquiring the communication link information of the current multicast service according to the service type information and the access network addresses of the source terminal and the target terminal.

Step 502, the first video network terminal stores the first audio data in a preset play queue and plays the first audio data.

The first video network terminal receives a first video network protocol data packet sent by a video network server, analyzes the first video network protocol data packet and obtains first audio data in the first video network protocol data packet.

The first video network terminal is preset with a play queue, and the play queue is used for storing audio data received by the first video network terminal and collected by the second video network terminal. Therefore, after receiving the first audio data, the first video network terminal stores the first audio data into a preset play queue. The first audio data stored in the play queue is used as reference data in a subsequent echo cancellation process.

And the first video network terminal decodes the first audio data by utilizing a decoding mode corresponding to the coding mode of the second video network terminal to obtain the first audio data suitable for playing. After decoding, the first video network terminal plays the decoded first audio data through the sound.

Step 503, the first video networking terminal collects the second audio data, extracts the first audio data from the play queue, and performs echo cancellation on the second audio data by using the first audio data to obtain third audio data.

In the embodiment of the invention, after the first video networking terminal starts to carry out bidirectional communication such as video conference, video call and the like, external sound can be collected through the microphone. For example, when a first video network terminal plays first audio data through a sound, the first video network terminal may collect sound when playing the first audio data through a microphone; when the user of the first video network terminal speaks, the first video network terminal can collect the speaking sound of the user through the microphone. And the first video network terminal encodes the collected sound to obtain second audio data. Of course, since the audio data collection may be performed continuously, if the first video network terminal does not collect external sound through the microphone, such as the user does not speak, and does not play the first audio data, the mute data may be filled as the second audio data.

The second audio data collected by the first video network terminal may be mixed audio data including the first audio data played through the audio and other audio data (such as a speech sound of a user). Therefore, the first audio data included in the mixed audio data is the echo, and if the second audio data is to be sent to the second video network terminal, the echo cancellation process may be performed first, and then the first audio data in the mixed audio data is cancelled and then sent. Specifically, the first video networking terminal extracts the first audio data from the play queue, and performs echo cancellation on the second audio data by using the first audio data to obtain third audio data.

In practical applications, the first terminal of the internet of view may call the Speex library to implement Echo Cancellation (AEC). Speex is an open source speech codec library, and Speex engineering focuses on lowering the input threshold of speech applications by providing an alternative to high performance speech codecs, and Speex is well suited for network applications, with its own unique advantages over other codecs. Specifically, echo cancellation may be performed by using an echo cancellation method, for example, estimating the size of an echo signal by an adaptive method, and then subtracting the estimated value from a received signal to cancel the echo, so as to achieve the purpose of cancellation. Therefore, the first video networking terminal eliminates the part of the audio data which is the same as the first audio data in the second audio data, and reserves the different part of the audio data, the reserved data is the third audio data, and the third audio data is the audio data after the echo is eliminated.

In a preferred embodiment, the echo cancellation process may not be performed during the period of time when the first video network terminal may not receive the first audio data collected and transmitted by the second terminal when the first video network terminal collects the second audio data. Thus, this step 503 may include: a first video network terminal collects second audio data and positions the second audio data to a position with a set frame number delayed; and the first video networking terminal extracts the first audio data from the play queue, and performs echo cancellation on the second audio data from the position by using the first audio data to obtain third audio data.

This approach is primarily directed to alignment of the first frame. After the first video network terminal collects a frame of second audio data, the first video network terminal firstly skips the second audio data with a set frame number, and the first video network terminal can be considered not to receive the first audio data sent by the second video network terminal in the period and can not play the first audio data, so that the second audio data collected by the first video network terminal can not include the first audio data, and echo cancellation processing is not performed on the skipped second audio data with the frame number. Starting from the second audio data collected after the skipping, it can be considered that the first video network terminal starts to receive the first audio data sent by the second video network terminal and plays the first audio data, so that the second audio data collected by the first video network terminal includes the first audio data, and therefore, the echo cancellation processing is performed starting from the second audio data collected after the skipping.

For setting the frame number, a person skilled in the art may select any suitable value according to practical experience, and the embodiment of the present invention is not limited thereto. For example, the set frame number may be 30 frames, each frame time is 10 milliseconds, and the total time of the 30 frames is 300 milliseconds, so that after a frame of second audio data is acquired, 30 frames are skipped, that is, 300 milliseconds are skipped, during which the echo cancellation processing is not performed on the acquired second audio data, and the echo cancellation processing is performed on the acquired second audio data at the beginning of the 31 st frame, that is, after 300 milliseconds.

By the first frame alignment mode, the echo cancellation process can be prevented from being executed when the first video network terminal does not play the first audio data, so that the mixed data and the reference data cannot be accurately aligned during echo cancellation, and the echo cancellation effect is influenced.

In a preferred embodiment, the first video network terminal may further dynamically adjust the number of frames in the play queue to keep it within a certain range. Therefore, the first video network terminal can judge whether the frame number stored in the play queue is within the range of the set frame number; and if the frame number stored in the play queue is judged not to be in the set frame number range, adjusting the frame number stored in the play queue to be in the set frame number range.

Specifically, the step of adjusting the frame number of the first audio data to be within the set frame number range may include: if the frame number stored in the play queue is smaller than the minimum frame number of the set frame number range, adding mute data into the play queue to enable the frame number stored in the play queue to be within the set frame number range; if the frame number stored in the play queue is larger than the maximum frame number in the set frame number range, deleting part of the first audio data from the play queue to enable the frame number stored in the play queue to be in the set frame number range.

For setting the frame number range, a person skilled in the art may select any suitable value according to practical experience, and the embodiment of the present invention is not limited thereto. For example, the frame number range may be set to be 5-15 frames, and the time of each frame is 10 milliseconds, and the time range is 50-150 milliseconds. If the number of frames stored in the play queue is less than 5 frames, for example, 4 frames, adding mute data to the play queue from the rear end of the play queue, so that the number of frames stored in the play queue is within 5-15 frames, for example, at least 1 frame of mute data can be added; if the number of frames stored in the play queue is greater than 15 frames, for example, 16 frames, part of the first audio data is deleted from the front end of the play queue, so that the number of frames stored in the play queue is within 5-15 frames, for example, at least 1 frame of the first audio data can be extended.

By the dynamic adjustment mode, the frame number of the play queue can be kept in a fixed range all the time, so that the phenomenon that the frame number of the play queue is unstable and the echo cancellation process is influenced due to network blockage, busy CPU scheduling, process priority influence and other factors is avoided, and the stability and the accuracy of the echo cancellation are further improved.

And step 504, the first video networking terminal sends the third audio data to the video networking server based on the video networking protocol, and the video networking server sends the third audio data to the second video networking terminal according to a second downlink communication link configured for the second video networking terminal.

And the first video networking terminal performs echo cancellation on the second audio data to obtain third audio data, and sends the third audio data to the video networking server. In a specific implementation, the first video networking terminal may encapsulate the third audio data into a second video networking protocol data packet based on a video networking protocol, and send the second video networking protocol data packet to the video networking server through the video networking. The video networking protocol may be a protocol for processing audio data in video networking, such as the 2001 protocol.

And after receiving the third audio data sent by the first video networking terminal, the video networking server forwards the third audio data to the second video networking terminal, namely forwards the received second video networking protocol data packet in which the third audio data is encapsulated to the second video networking terminal.

In the embodiment of the present invention, the video networking server may send the third audio data to the second video networking terminal according to the second downlink communication link configured for the second video networking terminal.

And the second video network terminal receives and analyzes a second video network protocol data packet sent by the video network server to obtain third audio data in the second video network protocol data packet. And the second video networking terminal decodes the third audio data by utilizing a decoding mode corresponding to the coding mode of the first video networking terminal to obtain third audio data suitable for playing. And after decoding, the second video network terminal plays the decoded third audio data through the sound. Since the third audio data is the audio data after echo cancellation, the user of the second video network terminal will not hear the sound of the first audio data sent to the first video network terminal before hearing himself when playing the third audio data.

It should be noted that, because the first video network terminal and the second video network terminal are opposite terminals, the second video network terminal may also execute the echo cancellation process executed by the first terminal, that is, the process similar to step 501 to step 504, which is not discussed in detail in this embodiment of the present invention.

To summarize, the echo cancellation method of embodiments of the present invention may comprise two separate threads.

One is a play thread which is mainly used for the near end to play audio data which is issued by the video network server and collected by the far end. The playing thread may specifically correspond to step 501 and step 502, and the first video network terminal receives the first audio data, acquires (may acquire in real time or may acquire at regular time) the first audio data in a Windows driver callback manner to play, and stores the first audio data in the play queue.

The other is an echo cancellation thread which is mainly used for collecting audio data at the near end, canceling echo included in the audio data, and sending the audio data to the far end after the echo is canceled. The echo cancellation thread may specifically correspond to step 503 and step 504, where the first video network terminal acquires (may acquire in real time or at regular time) the second audio data in a Windows-driven callback manner, extracts the first audio data from the play queue as reference data, specifically extracts one frame of the first audio data from the front end of the play queue each time, performs echo cancellation on the second audio data by using the extracted first audio data, skips over the frame number acquired when the second audio data is not received at the beginning by using a first frame alignment manner, deletes a part of the stored frame numbers or fills a part of mute data for the play queue by using a dynamic adjustment manner, and sends the third audio data obtained after performing echo cancellation to the second video network terminal by using the first video network terminal.

The two threads are independently executed without mutual interference, and the core process of echo cancellation is basically completed through the echo cancellation thread, so that the process is simpler and more convenient.

In the embodiment of the invention, the echo in the two-way communication can be eliminated in a software mode, the method is simple and convenient, and the cost is lower; and the data transmission process is based on a video networking protocol, so that the transmission is faster, and the real-time performance of communication is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Example two

Referring to fig. 6, a block diagram of an echo cancellation device according to a second embodiment of the present invention is shown. The device can be applied to a first video network terminal of a video network. It should be noted that each video network terminal in the video network may be used as the first video network terminal.

The echo cancellation device of the embodiment of the present invention may include the following modules:

a receiving module 601, configured to receive first audio data sent by a video networking server according to a first downlink communication link configured for the first video networking terminal; the first audio data is collected by a second video network terminal and is sent to the video network server;

a playing module 602, configured to store the first audio data in a preset playing queue, and play the first audio data;

a cancellation module 603, configured to collect second audio data, extract the first audio data from the play queue, and perform echo cancellation on the second audio data by using the first audio data to obtain third audio data;

a sending module 604, configured to send the third audio data to the video networking server based on a video networking protocol, and send the third audio data to the second video networking terminal by the video networking server according to a second downlink communication link configured for the second video networking terminal.

In a preferred embodiment, the elimination module comprises: the data positioning unit is used for acquiring second audio data and positioning the second audio data to a position with a delay set frame number; and the echo cancellation unit is used for extracting first audio data from the play queue, and performing echo cancellation on second audio data starting from the position by using the first audio data to obtain third audio data.

In a preferred embodiment, the apparatus further comprises: and the adjusting module is used for adjusting the frame number stored in the play queue to the set frame number range if the frame number stored in the play queue is judged not to be in the set frame number range.

In a preferred embodiment, the adjusting module comprises: a first adjusting unit, configured to add mute data to the play queue if the number of frames stored in the play queue is smaller than the minimum number of frames in the set frame number range, so that the number of frames stored in the play queue is within the set frame number range; and the second adjusting unit is used for deleting part of the first audio data from the play queue if the number of frames stored in the play queue is greater than the maximum number of frames in the set frame number range, so that the number of frames stored in the play queue is in the set frame number range.

In a preferred embodiment, the set frame number ranges from 5 to 15 frames.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The echo cancellation method and the echo cancellation device provided by the present invention are described in detail above, and the principle and the implementation of the present invention are explained in the present document by applying specific examples, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An echo cancellation method applied to a video network, the method comprising:

if the first video network terminal judges that the frame number stored in the play queue is not in the set frame number range, adjusting the frame number stored in the play queue to be in the set frame number range to obtain adjusted first audio data;

the first video networking terminal collects second audio data, extracts the first audio data with the frame number within a set frame number range from the play queue, and performs echo cancellation on the second audio data by using the first audio data with the frame number within the set frame number range to obtain third audio data;

2. The method according to claim 1, wherein the step of the first video network terminal acquiring second audio data, extracting the first audio data from the play queue, and performing echo cancellation on the second audio data by using the first audio data to obtain third audio data comprises:

the first video network terminal collects second audio data and positions the second audio data to a position with a set frame number delayed;

and the first video networking terminal extracts first audio data from the play queue, and performs echo cancellation on second audio data starting from the position by using the first audio data to obtain third audio data.

3. The method of claim 1, wherein the step of adjusting the frame number of the first audio data to be within the set frame number range comprises:

if the frame number stored in the play queue is smaller than the minimum frame number of the set frame number range, adding mute data into the play queue to enable the frame number stored in the play queue to be within the set frame number range;

if the frame number stored in the play queue is larger than the maximum frame number in the set frame number range, deleting part of the first audio data from the play queue to enable the frame number stored in the play queue to be in the set frame number range.

4. The method of claim 1, wherein the set frame number is in a range of 5 to 15 frames.

5. An echo cancellation device, wherein the device is applied to a first terminal of an internet of view, the device comprising:

the adjusting module is used for adjusting the frame number stored in the play queue to be within the set frame number range if the frame number stored in the play queue is judged not to be within the set frame number range;

the eliminating module is used for acquiring second audio data, extracting the first audio data with the frame number within a set frame number range from the play queue, and performing echo elimination on the second audio data by using the first audio data with the frame number within the set frame number range to obtain third audio data;

6. The apparatus of claim 5, wherein the cancellation module comprises:

the data positioning unit is used for acquiring second audio data and positioning the second audio data to a position with a delay set frame number;

and the echo cancellation unit is used for extracting first audio data from the play queue, and performing echo cancellation on second audio data starting from the position by using the first audio data to obtain third audio data.

7. The apparatus of claim 5, wherein the adjustment module comprises:

a first adjusting unit, configured to add mute data to the play queue if the number of frames stored in the play queue is smaller than the minimum number of frames in the set frame number range, so that the number of frames stored in the play queue is within the set frame number range;

and the second adjusting unit is used for deleting part of the first audio data from the play queue if the number of frames stored in the play queue is greater than the maximum number of frames in the set frame number range, so that the number of frames stored in the play queue is in the set frame number range.

8. The apparatus of claim 5, wherein the set frame number is in a range of 5-15 frames.