CN113707160A

CN113707160A - Echo delay determination method, device, equipment and storage medium

Info

Publication number: CN113707160A
Application number: CN202110246487.1A
Authority: CN
Inventors: 梁启仍
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-11-26

Abstract

The embodiment of the application discloses a method, a device, equipment and a storage medium for determining echo delay, wherein the method comprises the following steps: embedding watermark information in a reference audio signal to be played to obtain a target audio signal; playing the target audio signal; and collecting near-end audio signals; carrying out watermark information analysis processing on the near-end audio signal; and under the condition that the watermark information is analyzed from the near-end audio signal through the watermark information analysis processing, determining echo delay according to the position of the watermark information in the target audio signal and the position of the watermark information in the near-end audio signal. The method can accurately determine the echo delay, thereby being beneficial to improving the echo cancellation effect, and can improve the transmission quality of the watermark information embedded in the audio signal under the strong attack by designing the embedding structure of the watermark information.

Description

Echo delay determination method, device, equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining echo delay.

Background

Echo is now relatively common in Real-time Communication (RTC) scenarios. FIG. 1 is a schematic diagram illustrating the generation of echo in an RTC scenario; as shown in fig. 1, when a user a and a user B perform a real-time voice call, a voice a sent by the user a is collected by the terminal device 110 and transmitted to the terminal device 120 through a network; after receiving the voice a, the terminal device 120 correspondingly plays the voice a, collects the voice B sent by the user B and the voice a played by the user B, and transmits the collected voice B and the collected voice a to the terminal device 110 through the network for playing, at this time, the user a will hear the voice a sent by the user a before, which is an echo. The presence of echo can interfere with voice communication quality and degrade speech intelligibility.

In order to avoid the voice call quality from being interfered by the Echo, the Echo Cancellation (AEC) technology has been developed. When the terminal equipment eliminates the echo based on the echo elimination technology, the audio signal which is sent by opposite equipment and needs to be played is taken as a far-end audio signal, the audio signal which is collected by the terminal equipment and needs to be sent to the opposite equipment is taken as a near-end audio signal, and the echo in the near-end audio signal is filtered by taking the far-end audio signal as a basis.

Since the echo formation usually needs to go through three stages, namely audio signal playing, air propagation and audio signal acquisition, the echo included in the near-end audio signal will lag behind the far-end audio signal, and this lag amount is echo delay (echo delay). When the terminal device cancels the echo in the near-end audio signal based on the echo cancellation technology, it is usually required to align the near-end audio signal with the far-end audio signal by using the echo delay, and then cancel the echo in the near-end audio signal according to the far-end audio signal. Therefore, the echo delay is determined as a preprocessing technology of the echo cancellation, and whether the determined echo delay is accurate or not can influence the effect of the echo cancellation to a great extent.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for determining echo delay, which can accurately determine the echo delay, thereby being beneficial to improving the echo cancellation effect.

In view of the above, a first aspect of the present application provides an echo delay determining method, including:

embedding watermark information in a reference audio signal to be played to obtain a target audio signal;

playing the target audio signal; and collecting near-end audio signals;

carrying out watermark information analysis processing on the near-end audio signal;

and under the condition that the watermark information is analyzed from the near-end audio signal through the watermark information analysis processing, determining echo delay according to the position of the watermark information in the target audio signal and the position of the watermark information in the near-end audio signal.

A second aspect of the present application provides an echo delay determination apparatus, comprising:

the watermark embedding module is used for embedding watermark information into a reference audio signal to be played to obtain a target audio signal;

the audio playing module is used for playing the target audio signal;

the audio acquisition module is used for acquiring a near-end audio signal;

the watermark analyzing module is used for analyzing the watermark information of the near-end audio signal;

an echo delay determining module, configured to determine an echo delay according to a position of the watermark information in the target audio signal and a position of the watermark information in the near-end audio signal when the watermark information is analyzed from the near-end audio signal through the watermark information analysis processing.

A third aspect of the present application provides an electronic device comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is adapted to perform the steps of the echo delay determination method according to the first aspect as described above, according to the computer program.

A fourth aspect of the present application provides a computer-readable storage medium for storing a computer program for executing the steps of the echo delay determination method of the first aspect.

A fifth aspect of the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of the echo delay determination method according to the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides an echo delay determination method which innovatively applies an audio watermarking technology to the determination of the echo delay. During specific implementation, embedding inaudible watermark information of human ears in a reference audio signal to be played to obtain a target audio signal; then, playing the target audio signal and collecting a near-end audio signal; further, analyzing and processing watermark information of the collected near-end audio signal; in the case where the watermark information embedded up to that time is analyzed from the near-end audio signal by the watermark information analysis processing, the time lag of the echo in the near-end audio signal with respect to the target audio signal, that is, the echo delay is determined based on the position of the watermark information in the target audio signal and the position of the watermark information in the near-end audio signal. On one hand, the echo delay is determined based on the watermark information in the near-end audio signal, and no special requirement is made on the signal-to-noise ratio of the near-end audio signal, so that even under the condition that various audio signals are mixed in the near-end audio signal, the echo delay can be accurately determined by the method provided by the embodiment of the application. On the other hand, for the device for executing the method provided by the embodiment of the application, the echo delay can be accurately determined without consuming a large amount of computing resources. On the other hand, the method provided by the embodiment of the application has better compatibility and universality for different hardware devices and software applications, namely for different hardware devices and software applications, the echo delay can be accurately determined by the method provided by the embodiment of the application.

Drawings

FIG. 1 is a schematic diagram illustrating the generation of echo in an RTC scenario;

FIG. 2 is a schematic diagram of the operation principle of an echo cancellation module in communication software;

FIG. 3 is a schematic diagram illustrating an implementation principle of aligning a far-end audio signal and a near-end audio signal;

FIG. 4 is a schematic diagram illustrating an implementation of an echo cancellation module for canceling echo;

fig. 5 is a schematic view of an application scenario of the echo delay determination method according to the embodiment of the present application;

fig. 6 is a schematic flowchart of an echo delay determination method according to an embodiment of the present application;

fig. 7 is a schematic flowchart of generating a target audio signal according to an embodiment of the present application;

fig. 8 is a schematic diagram of a frame structure of a watermark source coding frame provided in an embodiment of the present application;

fig. 9 is a schematic diagram of a frame structure of a channel coding frame according to an embodiment of the present application;

fig. 10 is a schematic flowchart of watermark information parsing processing provided in an embodiment of the present application;

fig. 11 is a schematic diagram illustrating an implementation principle of watermark information injection at a play end according to an embodiment of the present application;

fig. 12 is a schematic diagram illustrating an implementation principle of analysis of watermark information at a recording end according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a first echo delay determination device according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a second echo delay determination device according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a third echo delay determination device according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of a fourth echo delay determination device according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to an artificial intelligence voice technology, and is specifically explained by the following embodiment:

in order to cancel echo in the collected near-end audio signal, an echo cancellation module is usually disposed in both hardware devices and software applications having a real-time communication function. FIG. 2 is a schematic diagram of the operation principle of an echo cancellation module in certain communication software; as shown in fig. 2, when a user performs a real-time voice call through communication software, a far-end audio signal received by the communication software is played through a broadcasting module frame, and after the far-end audio signal is played, the communication software may acquire the played far-end audio signal when acquiring a near-end audio signal through a recording module frame, where the acquired far-end audio signal is an echo; furthermore, the echo cancellation module in the communication software cancels the echo in the near-end audio signal based on the far-end audio signal previously received by the communication software through the echo cancellation kernel, and then sends out the near-end audio signal with the echo cancelled.

As can be seen from the schematic operation of the echo cancellation module shown in fig. 2, the echo formation of the far-end audio signal needs to go through three stages, which are, in turn, a broadcasting module frame (including a software channel and a hardware channel), an acoustic air propagation, and a recording module frame (including a software channel and a hardware channel), based on which, the echo in the near-end audio signal lags behind the far-end audio signal, which is an echo delay, and the echo delay has an expression shown in formula (1):

echo delay＝playback delay + broadcast delay + record delay (1)

the playback delay is used from the time when the far-end audio signal is restored to the time when the far-end audio signal is played by an audio playing device (such as a loudspeaker, etc.), and the playback delay usually has a great difference for different operating systems; generally, the playback delay of the Android system is 100-300ms, and the playback delay of the IOS system is 50-80 ms.

The delay is related to the length of a physical path along which the audio signal propagates when the far-end audio signal is propagated out of an audio playing device (such as a speaker, etc.) and through air to an audio collecting device (such as a microphone, etc.), and the delay can be generally ignored because the distance between the audio playing device and the audio collecting device of the terminal device is generally short.

The received delay is the time taken from the audio acquisition device to the echo cancellation module to obtain the near-end audio signal including the far-end audio signal, and is usually about 10 ms.

An echo Delay (also known as an echo Delay estimate) is determined as a pre-processing technique for echo cancellation to determine the time difference between the echo in the near-end audio signal and the far-end audio signal. When the echo cancellation module is specifically operating, it is necessary to align the far-end audio signal and the near-end audio signal based on the determined echo delay, as shown in fig. 3; further, the near-end audio signal is sequentially subjected to adaptive filtering and nonlinear processing according to the far-end audio signal to filter the echo in the near-end audio signal, as shown in fig. 4. In many cases, the ability to accurately determine whether the echo delay will significantly affect the performance of the echo cancellation module, i.e., the effectiveness of the echo cancellation.

In the related art, the echo delay is currently determined mainly by the following three implementations:

in a first implementation manner, a near-end audio signal and a far-end audio signal are respectively transformed to a frequency domain to obtain a near-end frequency spectrum and a far-end frequency spectrum; then, respectively carrying out binary processing on the near-end frequency spectrum and the far-end frequency spectrum to obtain a near-end binary spectrum and a far-end binary spectrum; the echo delay is then estimated by comparing the near-end binary spectrum with the far-end binary spectrum. Due to the fact that spectrum energy binarization is needed, the requirement for the signal-to-noise ratio of the near-end audio signal is high, performance in a double-talk scene (namely, multiple audio signals are mixed in a voice acquisition environment, so that the multiple audio signals are mixed in the acquired near-end audio signal) is poor, and the accuracy of the estimated echo delay is low.

In a second implementation, the echo delay is determined based on a Generalized cross-correlation function method (Generalized Corss-correlation); the basic principle is to obtain the cross power frequency spectrum between the near-end audio signal and the far-end audio signal, then give different weights in the frequency domain to carry out weighting operation, and finally perform inverse transformation to the time domain to obtain the cross correlation function between the near-end audio signal and the far-end audio signal, wherein the time corresponding to the extreme value of the cross correlation function is the echo delay. The performance of this implementation is relatively better than that of the first implementation, but when determining the echo delay by this implementation, a large number of domain transformation operations and cross-correlation operations need to be performed, the amount of computation is large for the terminal device, and the requirement for the computing power of the terminal device is high.

In a third implementation mode, echo delay is determined by a machine learning method; namely, a targeted model training is performed for the terminal device, and a neural network model for determining the echo delay of the terminal device is obtained. Although the implementation method can determine the echo delay more accurately, because different terminal devices have certain differences in terms of hardware performance and the like, the echo delay of the terminal devices cannot be determined accurately by the same model for the terminal devices, and corresponding models need to be trained specially for the different terminal devices, that is, the model is poor in universality and compatibility.

In order to solve the problems in the related art, embodiments of the present application provide an echo delay determining method, which can accurately determine echo delays in various scenarios, and does not need to consume a large amount of computing resources of a terminal device, and has better universality and compatibility.

Specifically, in the echo delay determining method provided in the embodiment of the present application, watermark information that is inaudible to the human ear is embedded in a reference audio signal to be played to obtain a target audio signal; then, playing the target audio signal and collecting a near-end audio signal; further, analyzing and processing watermark information of the collected near-end audio signal; in the case where the watermark information embedded up to that time is analyzed from the near-end audio signal by the watermark information analysis processing, the time lag of the echo in the near-end audio signal with respect to the target audio signal, that is, the echo delay is determined based on the position of the watermark information in the target audio signal and the position of the watermark information in the near-end audio signal.

The method for determining the echo delay applies an audio watermarking technology to determine the echo delay, based on an auditory masking mechanism of human ears, under the condition that the audio playing quality is not influenced and the human ears are not aware, watermark information is embedded into a reference audio signal to obtain a target audio signal, and because the acquisition from the audio playing to the echo forming is a closed loop, the near-end audio signal acquired when the target audio signal is played should also comprise the watermark information, and further, the echo delay can be determined according to the positions of the watermark information in the target audio signal and the near-end audio signal. On one hand, the echo delay is determined based on the watermark information in the near-end audio signal, and no special requirement is made on the signal-to-noise ratio of the near-end audio signal, so that even under the condition that various audio signals are mixed in the near-end audio signal, the echo delay can be accurately determined by the method provided by the embodiment of the application. On the other hand, for the device for executing the method provided by the embodiment of the application, the echo delay can be accurately determined without consuming a large amount of computing resources. On the other hand, the method provided by the embodiment of the application has better compatibility and universality for different hardware devices and software applications, namely for different hardware devices and software applications, the echo delay can be accurately determined by the method provided by the embodiment of the application.

It should be understood that, in practical applications, the echo delay determination method provided in the embodiment of the present application may be applied to a terminal device, and the terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like, but is not limited thereto.

In order to facilitate understanding of the echo delay determination method provided in the embodiment of the present application, an application scenario of the echo delay determination method provided in the embodiment of the present application is described in the following.

Referring to fig. 5, fig. 5 is a schematic view of an application scenario of the echo delay determination method according to the embodiment of the present application. As shown in fig. 5, the application scenario includes a terminal device 510 and a terminal device 520, target communication software runs in both the terminal device 510 and the terminal device 520, and the terminal device 510 and the terminal device 520 can communicate via a network. Both the terminal device 510 and the terminal device 520 may be configured to execute the echo delay determination method provided in the embodiment of the present application, and the echo delay determination method executed by the terminal device 510 is taken as an example and described below.

In practical applications, user a using terminal device 510 and user B using terminal device 520 may conduct a real-time voice call through the target communication software. In the real-time voice call process, the terminal device 510 receives an audio signal sent by the terminal device 520 through the network, and uses the audio signal as a reference audio signal to be played; then, watermark information that is not audible to human ears is embedded in the reference audio signal to obtain a target audio signal, where the watermark information may be preset watermark information. Further, the terminal device 510 may play the target audio signal.

In the real-time voice call process, the terminal device 510 may continuously collect an audio signal in its own environment, where the audio signal collected by the terminal device 510 is a near-end audio signal. Furthermore, the terminal device 510 may perform watermark information analysis processing on the collected near-end audio signal, and if the watermark information embedded in the target audio signal is analyzed from the near-end audio signal, the terminal device may determine the echo delay according to the position of the watermark information in the target audio signal and the position of the watermark information in the near-end audio signal.

Further, the terminal device 510 may filter the echo in the near-end audio signal based on the determined echo delay, and transmit the echo-filtered audio signal to the terminal device 520 through the network. Similarly, during the real-time voice call, the terminal device 520 also performs the above-mentioned operations.

It should be noted that the application scenario shown in fig. 5 is only an example, and in practical application, the echo delay determination method provided in the embodiment of the present application may be applied to a scenario of a two-person real-time voice call, and may also be applied to a scenario of a multi-person real-time voice call. In addition, the echo delay determination method provided by the embodiment of the application can be applied to a real-time voice call scene through communication software, and can also be applied to a real-time voice call scene based on hardware equipment, such as making a call. In addition, the echo delay determination method provided by the embodiment of the application can also be applied to a real-time video scene. The application scenario of the echo delay determination method provided in the embodiment of the present application is not limited at all.

It should be noted that echo is also a problem that needs to be solved in a scene where a user talks to an intelligent device (such as an intelligent sound box, an intelligent voice assistant in a terminal device, a vehicle-mounted voice recognition device, etc.). Specifically, in many cases, the audio signal collected by the intelligent device may include both the voice signal sent by the user and the voice signal (i.e., echo) sent by the intelligent device itself, and the intelligent device directly performs subsequent processing based on the collected audio signal, so that a series of problems such as misrecognition, and misresponse are easily caused, resulting in poor user experience; in order to avoid such a situation, such an intelligent device often needs to perform echo cancellation processing on the acquired audio signal, and accordingly, before performing echo cancellation processing, the echo delay may be determined by the echo delay determination method provided in the embodiment of the present application.

The echo delay determination method provided by the present application is described in detail below by way of an embodiment of the method.

Referring to fig. 6, fig. 6 is a schematic flowchart of an echo delay determination method according to an embodiment of the present application. The following embodiments are introduced by taking an execution subject of the echo delay determination method as a terminal device, as shown in fig. 6, the echo delay determination method includes the following steps:

step 601: and embedding watermark information in the reference audio signal to be played to obtain a target audio signal.

In an application scenario where echo is to be eliminated, before a terminal device plays a reference audio signal, an audio watermarking technology is required to embed watermark information that is inaudible to the human ear into the reference audio signal to be played, so as to obtain a corresponding target audio signal.

It should be noted that the audio watermarking technology is a technology that embeds watermark information that is not audible to human ears into an audio stream to be played by using an auditory masking mechanism, and can identify and authenticate the embedded watermark information at a decoding end. Currently, the main uses of audio watermarking technology include: the copyright of audio works is protected, live broadcast content is prevented from being illegally recorded, and a recorded online conference is divulged and traced.

It should be noted that, in different application scenarios, the reference audio signal may be an audio signal generated in different manners. For example, in a real-time communication application scenario, the reference audio signal should be an audio signal sent by a terminal device of a communication counterpart; for example, in an application scenario in which a user a and a user B perform a real-time voice call, for a terminal device used by the user a, an audio signal sent by the terminal device used by the user B is a reference audio signal, and for a terminal device used by the user B, an audio signal sent by the terminal device used by the user a is a reference audio signal; it should be understood that real-time communication application scenarios include, but are not limited to: a real-time voice call scenario for two or more people, a real-time video call scenario for two or more people. For example, in an application scenario where a user has a conversation with a terminal device, the reference audio signal should be an audio signal generated by the terminal device in response to the conversation content of the user; for example, in an application scenario in which a user has a conversation with a smart speaker, after receiving a voice signal sent by the user, the smart speaker may generate an audio signal for responding to the voice signal, where the audio signal is a reference audio signal. Of course, in other application scenarios, the reference audio signal may also be an audio signal generated in other manners, and the application scenarios of the embodiment of the present application are not limited at all, and the generation manner of the reference audio signal is not limited at all.

It should be noted that the watermark information embedded in the reference audio signal may be preset, for example, the watermark information may be preset text information or binary code, and the application does not limit the watermark information embedded in the reference audio signal in any way.

In a possible implementation manner, when the terminal device embeds watermark information in the reference audio signal, it can be implemented by the flow shown in fig. 7. Fig. 7 is a schematic flowchart of generating a target audio signal according to an embodiment of the present application. As shown in fig. 7, the generation flow of the target audio signal includes the following steps:

step 701: and carrying out source coding on the watermark information to obtain a watermark source coding frame.

Before embedding the watermark information into the reference audio signal, the terminal device needs to perform source coding on preset watermark information (such as preset text information or binary codes) to obtain a corresponding watermark source coding frame, so that the watermark information can be embedded into the reference audio signal.

During specific implementation, the terminal device can divide the watermark information by taking the preset byte length as a unit to obtain a plurality of pieces of sub-watermark information; then, for each piece of sub-watermark information, performing source coding on the sub-watermark information to obtain a watermark source coding frame corresponding to the sub-watermark information, where the watermark source coding frame includes the byte length of the watermark information, the arrangement serial number of the sub-watermark information in the watermark information, the sub-watermark information itself, and a check code.

For example, the terminal device may perform division processing on the watermark information in units of a single byte to obtain a plurality of pieces of sub-watermark information. Then, according to a preset watermark information source coding frame structure, carrying out information source coding on each sub-watermark information to obtain a corresponding watermark information source coding frame; taking the frame structure of the watermark information source coding as shown in fig. 8 as an example, when a terminal device constructs a watermark information source coding frame corresponding to a certain sub-watermark information, the byte length (i.e., the watermark length) of the watermark information and the arrangement sequence number (i.e., the byte sequence number) of the sub-watermark information in the watermark information may be added to the frame header of the watermark information source coding frame, the content (i.e., the byte content) of the sub-watermark information is added to the frame body of the watermark information source coding frame, and the check code is added to the frame tail of the watermark information source coding frame; the default length of a watermark source coding frame is assumed to be 32 bits, wherein the byte length of watermark information can occupy 4 bits, the arrangement serial number of sub-watermark information in the watermark information can occupy 4 bits, the content of the sub-watermark information can occupy 8 bits, and the check code can occupy 16 bits; of course, in practical applications, the watermark source coding frame structure may also be a structure in other forms, and the watermark source coding frame structure is not limited in this application.

It should be noted that the Check code in the above-mentioned watermark source coding frame may be a Cyclic Redundancy Check (CRC), which is a kind of Check code with error detection and correction capability. The check code in the watermark source coding frame can also be a check code generated in a packet check mode. The check code in the watermark source coding frame is not limited in any way in the present application.

Step 702: a target location for embedding watermark information in the reference audio signal is detected.

Furthermore, before embedding the watermark information into the reference audio signal, the terminal device needs to detect a target position in the reference audio signal that can be used for embedding the watermark information, i.e. a target position that can be used for embedding the watermark source-coded frame.

In specific implementation, the terminal device may detect an energy spectrum envelope of a reference audio signal, determine a position where the energy spectrum envelope in the reference audio signal exceeds a preset energy threshold as a target position where watermark information may be embedded, and mark a watermark loading enabling bit for the target position.

For example, after acquiring the reference audio signal, the terminal device may detect an energy spectrum envelope of the reference audio signal, where the energy spectrum envelope of the reference audio signal can represent energy levels at various positions in the reference audio signal. Furthermore, based on the energy spectrum envelope of the reference audio signal, detecting a position in the reference audio signal where the energy spectrum envelope exceeds a preset energy threshold, that is, detecting a position in the reference audio signal where energy is higher, determining the position in the reference audio signal where energy is higher as a target position where watermark information can be embedded, and marking a watermark loading enabling flag bit for the target position. In this way, watermark information is prevented from being embedded in the audio signal with silence or low energy in the reference audio signal, so as to avoid the situation that effective information is lost at the decoding end of the audio signal.

It should be understood that, in an actual application, the terminal device may execute step 701 first and then execute step 702, may also execute step 702 first and then execute step 701, and may also execute step 701 and step 702 at the same time, where the present application does not make any limitation on the execution order of step 701 and step 702.

Step 703: and performing channel coding on the reference audio signal and the watermark source coding frame based on the target position in the reference audio signal to obtain the target audio signal.

The terminal equipment completes source coding of the watermark information to obtain a watermark source coding frame, and after a target position which can be used for embedding the watermark information in the reference audio signal is detected, channel coding can be performed on the reference audio signal and the watermark source coding frame based on the target position in the reference audio signal to realize embedding of the watermark information into the reference audio signal, so that a target audio signal is obtained.

In a specific implementation, in the process of performing channel coding on the reference audio signal, the terminal device may determine whether the current coding position of the reference audio signal is a target position where the watermark information can be embedded. If the current coding position is the target position, the terminal equipment can embed a watermark information source coding frame in the audio signal at the current coding position in the reference audio signal through a watermark modulation algorithm to obtain a first signal to be coded; and then carrying out channel coding on the first signal to be coded to obtain a channel coding frame corresponding to the current coding position. If the current coding position is not the target position, the terminal device may directly use the audio signal at the current coding position in the reference audio signal as a second signal to be coded, and perform channel coding on the second signal to be coded to obtain a channel coding frame corresponding to the current coding position. And then, combining the channel coding frames corresponding to the coding positions in the reference audio signal to obtain the target audio signal.

For example, when the terminal device performs channel coding on the reference audio signal, it may detect whether the current coding position of the reference audio signal is marked with a watermark loading enabling flag bit. If the current coding position is marked with a watermark loading enabling flag bit, the current coding position can be determined as a target position which can be used for embedding watermark information; then, adding a watermark information source coding frame obtained by information source coding in the audio signal at the current coding position by using a watermark modulation algorithm, thereby obtaining a first signal to be coded; and then, according to a preset channel coding frame structure, carrying out channel coding on the first signal to be coded to obtain a channel coding frame corresponding to the current coding position. Otherwise, if the current coding position is not marked with the watermark loading enabling zone bit, the current coding position is determined not to be the target position which can be used for embedding the watermark information; and then, directly taking the audio signal at the current coding position as second information to be coded, and carrying out channel coding on the second information to be coded according to a preset channel coding frame structure to obtain a channel coding frame corresponding to the current coding position. Thus, the channel coding processing is executed for each coding position in the reference audio signal to obtain a channel coding frame corresponding to each coding position, and the channel coding frames corresponding to each coding position are correspondingly combined according to the arrangement sequence of each coding position in the reference audio signal, so that the target audio signal can be obtained.

It should be noted that, in the selection of the watermark modulation algorithm, a proper watermark modulation algorithm may be selected according to the actual scene requirements. As an example, the embodiment of the application can adopt a time domain bidirectional multi-core echo hidden watermark modulation algorithm with strong robustness, small sound quality damage and low complexity; the main principle of the watermark modulation algorithm is that the watermark information is modulated into early reflected sound which can not be distinguished by human ears by utilizing a different-time masking mechanism in human ear hearing; in addition, the watermark modulation algorithm adopts two-way echo to resist interference caused by space multipath reflection, and adopts multi-core to enhance data transmission code rate. Of course, in practical applications, the terminal device may also use other watermark modulation algorithms to add the watermark source coding frame to the reference audio signal, and the application does not limit the used watermark modulation algorithm.

It should be noted that, because echo delay usually does not change greatly in a short time, in practical applications, a terminal device may not embed a watermark source coding frame in an audio signal at each target position in a reference audio signal; for example, the terminal device may embed a watermark source coding frame in the audio signal at a certain target position in the reference audio signal for the current reference audio signal at a certain time interval, for example, 1min, 30s, and so on. Of course, in order to ensure the accuracy of the determined echo delay, the terminal device may embed the watermark source coding frame in the audio signal at each target position in the reference audio signal, and the specific manner of embedding the watermark source coding frame is not limited in this application.

In addition, in order to improve the identification rate of the watermark information analysis end and the robustness to different transmission channels, an embodiment of the present application further provides a channel coding frame structure, and fig. 9 is a schematic diagram of the channel coding frame structure. As shown in fig. 9, the header of the channel-encoded frame is used to carry the synchronization code; the frame body of the channel coding frame is used for bearing a data packet, when the channel coding frame corresponds to a target position which can be used for embedding watermark information in a reference audio signal, the data packet comprises an audio signal embedded with a watermark source coding frame, namely a first signal to be coded, and when the channel coding frame corresponds to a position which cannot be used for embedding the watermark information in the reference audio signal, the data packet comprises an audio signal not embedded with the watermark source coding frame, namely a second signal to be coded; the frame body of the channel-encoded frame is used to carry an error correction code, which is generated according to the content carried by the frame header and the frame body of the channel-encoded frame.

For example, when the terminal device performs channel coding, a synchronization code may be added to a frame header of a channel coding frame, where the synchronization code may be a string of fixed code words used for frame synchronization, and a specific length and content of the synchronization code may be adjusted according to an actual channel condition. A data packet may be added to a frame body of the channel coding frame, and if the channel coding frame corresponds to the target position, the data packet may include the audio signal at the coding position in the reference audio signal and the watermark source coding frame; if the channel-encoded frame does not correspond to the target location, the audio signal at the encoded location in the reference audio signal may be included in the data packet. Error correcting codes can be added at the tail of a channel coding frame, and can be set to reduce the error rate of a decoding end under the condition that the signal-to-noise ratio of a channel is poor, so that the transmission quality of signals is ensured; illustratively, the error correction code in the channel coding frame may be a BCH error correction code, and when the terminal device generates the BCH error correction code in the channel coding frame, the information carried by the frame header and the frame body in the channel coding frame may be divided into a plurality of message groups according to a preset number of bits, and then each message group is converted into a binary digit group with a specific length, that is, a codeword, and the codewords corresponding to each message group may form the BCH error correction code; of course, in practical applications, the terminal device may also set other types of error correction codes in the channel coding frame, and the application does not limit the types of error correction codes in the channel coding frame.

It should be understood that the implementation manner of generating the target audio signal shown in fig. 7 is merely an example, in practical applications, the terminal device may also embed the watermark information into the reference audio signal in other manners to obtain the target audio signal, and the present application does not limit any manner of generating the target audio signal.

Step 602: playing the target audio signal; and collects near-end audio signals.

And the terminal equipment embeds the watermark information into the reference audio signal to be played to obtain a target audio signal, and then plays the target audio signal. In an application scenario where echo cancellation is required, audio signal playing and audio signal acquisition are generally performed simultaneously, so that terminal equipment can acquire an audio signal while playing a target audio signal, and the audio signal acquired by the terminal equipment is a near-end audio signal.

Exemplarily, in a real-time communication application scenario, the near-end audio signal is an audio signal acquired by the terminal device itself; for example, in an application scenario in which the user a and the user B perform a real-time voice call, for a terminal device used by the user a, an audio signal in an environment where the user a is located acquired by the user a is a near-end audio signal, and for a terminal used by the user B, an audio signal in an environment where the user B is located acquired by the user B is a near-end audio signal. Illustratively, in an application scenario in which a user has a conversation with a terminal device, a near-end audio signal is also an audio signal acquired by the terminal device itself; for example, in an application scenario in which a user has a conversation with a smart sound box, an audio signal collected by the smart sound box in an environment where the smart sound box is located is a near-end audio signal.

It should be understood that the near-end audio signal may include any audio signal collected by the terminal device in the environment where the terminal device is located, that is, the near-end audio signal may include not only a voice signal uttered by a user, but also a target audio signal played by the terminal device itself, and may also include a noise audio signal in the environment, and the audio signal included in the near-end audio signal is not limited in any way herein.

Step 603: and analyzing the watermark information of the near-end audio signal.

After the terminal device collects the near-end audio signal, the terminal device needs to analyze and process the watermark information of the near-end audio signal so as to judge whether the collected near-end audio signal includes the watermark information embedded in the reference audio signal.

In a possible implementation manner, when the terminal device performs watermark information analysis processing on the near-end audio signal, the processing may be implemented by the flow shown in fig. 10. Fig. 10 is a schematic flowchart of watermark information parsing processing according to an embodiment of the present application. As shown in fig. 10, the watermark information parsing process flow includes the following steps:

step 1001: and demodulating the near-end audio signal to obtain a binary bit stream.

After the terminal device collects the near-end audio signal, in order to ensure that the near-end audio signal can be correctly demodulated and further analyze whether the near-end audio signal includes the watermark information, the terminal device needs to demodulate the audio signal with the corresponding frame length in the near-end audio signal according to the frame length of the channel coding frame in the target audio signal, so as to obtain the corresponding binary bit stream.

Step 1002: and carrying out channel decoding on the binary bit stream to obtain a channel decoded stream.

The terminal device demodulates the near-end audio signal to obtain a binary bit stream, and then may further perform channel decoding on the binary bit stream to obtain a channel decoded stream.

In specific implementation, if a channel coding frame in a target audio signal generated by the terminal device before includes a synchronization code and an error correction code, the terminal device may perform frame synchronization based on the synchronization code in the binary bit stream; furthermore, in the process of channel decoding the binary bit stream, correcting the error code in the channel decoding stream based on the error correcting code in the binary bit stream; if the corrected bit error number does not exceed the preset bit error number, continuing to perform watermark demodulation processing on the channel decoding stream by adopting a watermark demodulation algorithm; and if the corrected bit error number exceeds the preset bit error number, discarding the channel decoding stream.

For example, when the terminal device performs channel decoding on the binary bit stream obtained by demodulating the near-end audio signal, frame synchronization may be performed according to a synchronization code in the binary bit stream, so as to separate channel decoded streams corresponding to each channel encoded frame in the binary bit stream. Further, the error correction code in the binary bit stream is used to carry out error correction processing on the separated channel decoding stream; it should be noted that, in the process from the playing of the target audio signal to the re-acquisition, the target audio signal may be interfered to a certain extent by the influence of factors such as the propagation environment of the audio signal, and further an error code occurs in the target audio signal; in order to solve the problem, the terminal device may correct the error code in the channel decoded stream by using the error correction code in the binary bit stream, that is, the terminal device may restore the previously generated channel encoded frame according to the error correction code by using an algorithm opposite to that used when the error correction code is generated, and then correct the error code in the channel decoded stream according to the restored channel encoded frame. If the bit error number corrected by the terminal equipment based on the error correcting code does not exceed the preset bit error number, the target audio signal is still in the error correcting capability range of the error correcting code, and the terminal equipment can continue to perform watermark demodulation processing based on the channel decoding stream of the corrected bit error; if the bit error number corrected by the terminal device based on the error correcting code exceeds the preset bit error number, the target audio signal is beyond the correction capability range of the error correcting code, the information carried in the channel decoded stream is possibly distorted, and the channel decoded stream can be discarded.

Step 1003: and performing watermark demodulation processing on the channel decoding stream through a watermark demodulation algorithm.

The terminal device completes channel decoding processing on the binary bit stream, and after obtaining the channel decoded stream, may further perform watermark demodulation processing on the channel decoded stream by using a watermark demodulation algorithm to determine whether the channel decoded stream carries a hidden encoded bit stream, that is, determine whether a watermark source encoded frame is embedded in the channel decoded stream.

It should be noted that the watermark demodulation algorithm used here by the terminal device should correspond to the watermark modulation algorithm used when embedding the watermark information. For example, if the watermark modulation algorithm adopted when the terminal device embeds the watermark information into the reference audio signal is an echo hidden modulation algorithm, the terminal device needs to perform watermark demodulation processing on the channel encoded stream by using a corresponding cepstrum method. Of course, if other watermark modulation algorithms are adopted when the watermark information is embedded into the reference audio signal, the terminal device may also correspondingly adopt other watermark demodulation algorithms to perform watermark demodulation processing on the channel encoded stream, and the watermark demodulation algorithm adopted in the present application is not limited at all.

Step 1004: and carrying out source decoding on the watermark source coding frame to obtain the watermark information under the condition that the watermark source coding frame is demodulated from the channel decoding stream through the watermark demodulation processing.

If the terminal device demodulates the watermark information source coding frame from the channel decoding stream in step 1003, the terminal device may continue to perform information source decoding on the watermark information source coding frame, so as to obtain watermark information carried in the watermark information source coding frame.

When the watermark information source coding frame is specifically implemented, the terminal equipment can firstly check the watermark information source coding frame according to the check code in the watermark information source coding frame; if the verification is passed, acquiring watermark information from the watermark information source coding frame; otherwise, if the check fails, the watermarked source-encoded frame may be discarded.

For example, assuming that a check code in a watermark source coding frame embedded in a reference audio signal by the terminal device is a CRC check code, when the terminal device checks the watermark source coding frame in the channel decoding stream, a polynomial used for generating the watermark source coding frame is divided by the polynomial; if the remainder is 0, it is indicated that the code word in the watermark information source coding frame has no error, the watermark information source coding frame passes the verification, and at this time, the watermark information can be obtained from the watermark information source coding frame; otherwise, if the remainder is not 0, it indicates that the codeword of the watermark source encoded frame has an error, and the watermark source encoded frame does not pass the verification, and at this time, the watermark source encoded frame may be discarded. It should be understood that if the check code in the watermark source coding frame previously embedded in the reference audio signal by the terminal device is another check code, the terminal device may also check the watermark source coding frame in the channel decoding stream by using another method, and the method for checking the watermark source coding frame is not limited herein.

It should be understood that, if the terminal device performs source coding on the watermark information to generate a watermark source coding frame before, the watermark information is split, and the watermark source coding frame is generated based on the split sub-watermark information, then the terminal device performs source decoding on the watermark source coding frame in the channel decoding stream at this time to obtain the watermark information, and actually obtains a certain sub-watermark information obtained by splitting the watermark information.

It should be understood that the implementation manner of the watermark information parsing process shown in fig. 10 is only an example, in practical applications, the terminal device may also perform the watermark information parsing process on the near-end audio signal in other manners, and the implementation manner of the watermark information parsing process is not limited in this application.

Step 604: and under the condition that the watermark information is analyzed from the near-end audio signal through the watermark information analysis processing, determining echo delay according to the position of the watermark information in the target audio signal and the position of the watermark information in the near-end audio signal.

If the terminal device analyzes the watermark information embedded in the reference audio signal through step 601 from the near-end audio signal acquired by the terminal device through step 603, the terminal device may determine a time lag of an echo (corresponding to the target audio signal) in the near-end audio signal relative to the target audio signal, that is, determine an echo delay, according to the embedding position of the watermark information in the target audio signal and the position of the watermark information analyzed from the near-end audio signal.

In a possible implementation manner, the terminal device may determine a time point at which an audio frame with watermark information embedded in a target audio signal is played through an audio playing channel, as a first time point; determining a time point of an audio frame which includes the watermark information and is acquired by the near-end audio signal through the audio acquisition channel as a second time point; and calculating the time difference between the second time point and the first time point, wherein the time difference is the echo delay.

Specifically, when the terminal device plays the target audio signal, a time point of playing the audio frame embedded with the watermark information through the audio playing channel may be recorded as a first time point; for example, assuming that the watermark information a is embedded in a fifth audio frame (which may be understood as the channel-encoded frame in the foregoing) in the target audio signal, the terminal device may record a time point of playing the fifth audio frame through the audio playing channel as the first time point, for example, assuming that the time point of playing the fifth audio frame through the audio playing channel by the terminal device is 9:44:35, 9:44:35 is taken as the first time point. When the terminal device collects the near-end audio signal, time points of each audio frame in the near-end audio signal collected through the audio collection channel may be recorded, and if the terminal device determines that a tenth audio frame in the near-end audio signal includes watermark information a through watermark information analysis processing, the terminal device may determine a time point at which the tenth audio frame is collected through the audio collection channel as a second time point, for example, if the time point at which the tenth audio frame is collected through the audio collection channel by the terminal device is 9:44:36, 9:44:36 is taken as the second time point. Further, the terminal device may calculate a time difference between the second time point and the first time point as the echo delay, for example, in the case where the first time point is 9:44:35 and the second time point is 9:44:36, the calculated echo delay is 1 s.

In another possible implementation manner, the terminal device may determine the first time duration according to the number of the audio frame in which the watermark information is embedded in the target audio signal and the first frame time duration; the first frame duration is a length of time of each audio frame played; the first time length is used for representing the time interval length between the time of playing the audio frame embedded with the watermark information through the audio playing channel and the initial playing time of the audio signal. The terminal device may determine the second duration according to the number of the audio frame including the watermark information in the near-end audio signal and the second frame duration; the second frame duration is a length of time of each captured audio frame; the second duration is used for representing the time interval length between the time of acquiring the audio frame including the watermark information through the audio acquisition channel and the initial acquisition time of the audio signal, and the initial acquisition time of the audio signal is the same as the initial playing time of the audio signal. And further, calculating the difference value between the second time length and the first time length to obtain the echo delay.

Specifically, in a real-time voice call scenario, an audio acquisition device and an audio playback device of a terminal device usually work simultaneously; that is, after the user connects the voice call through the terminal device, the speaker or the handset of the terminal device starts to play audio signals (the played audio signals may include blank audio signals), and the microphone of the terminal device also starts to collect audio signals in the current environment (the collected audio signals may also include blank audio signals). In other words, for the terminal device, the audio start playing time (i.e., the time when the audio playing device starts playing the audio signal) and the audio start capturing time (i.e., the time when the audio capturing device starts capturing the audio signal) are the same.

Based on this, the terminal device may determine the echo delay according to a time length from an audio start playing time to a playing time of an audio frame in which the watermark information is embedded, and a time length from an audio start acquisition time to an acquisition time of an audio frame including the watermark information. For example, the terminal device may configure numbers one by one from 1 to each audio frame in the target audio signal according to the playing time sequence from the audio start playing time, and further, the terminal device may calculate a time length from the audio start playing time to the playing time of the audio frame embedded with the watermark information b, that is, a first time length, according to the number of the audio frame embedded with the watermark information b in the target audio signal and the first frame time length (that is, the time length of each played audio frame); for example, assuming that the terminal device embeds the watermark information b in the fifth audio frame in the target audio signal, and the time length of each audio frame played is 100ms, the first time length calculated should be 5 × 100ms — 500 ms. Correspondingly, the terminal device may configure the numbers one by one from the beginning of the audio acquisition time to 1 for each audio frame in the near-end audio signal according to the acquisition time sequence, and further, the terminal device may calculate the time length from the beginning of the audio acquisition time to the acquisition time of the audio frame including the watermark information b, that is, the second time length, according to the number of the audio frame including the watermark information b in the near-end audio signal and the second frame time length (that is, the time length of each acquired audio frame); for example, assuming that the terminal device parses out that the tenth audio frame in the near-end audio signal includes the watermark information b, and the time length of each captured audio frame is 100ms, the calculated second time duration should be 10 × 100ms — 1000 ms. Furthermore, the terminal device may calculate a difference between the second time length and the first time length to obtain the echo delay, for example, in the case that the first time length is 500ms and the second time length is 1000ms, the calculated echo delay should be 500 ms.

It should be understood that the above implementation manner of determining the echo delay according to the position of the watermark information in the target audio signal and the position of the watermark information in the near-end audio signal is merely an example, and in practical applications, other manners may also be adopted to determine the echo delay according to the position of the watermark information in the target audio signal and the position of the watermark information in the near-end audio signal, and the present application does not limit the determination manner of the echo delay in any way.

After determining the echo delay, the terminal device may align the near-end audio signal and the target audio signal based on the echo delay, and further perform adaptive filtering and nonlinear processing on the near-end audio signal based on the target audio signal to eliminate an echo corresponding to the target audio signal in the near-end audio signal.

In a specific implementation, the terminal device may translate the target audio signal backward along the time axis based on the echo delay, so that the start time point of the target audio signal coincides with the start time point of the near-end audio signal. Then, the terminal device may use the target audio signal as a reference signal for filtering the echo, and perform adaptive filtering processing on the near-end audio signal based on the target audio signal to filter the echo corresponding to the target audio signal in the near-end audio signal; furthermore, the terminal device may perform nonlinear processing on the near-end audio signal obtained after the adaptive filtering processing based on the target audio signal to filter out a nonlinear echo corresponding to the target audio signal; in this way, a near-end audio signal will be obtained that does not include echo. In an application scenario of real-time communication, the terminal device may send the processed near-end audio signal to a terminal device of a communication counterpart; in an application scenario in which the user has a conversation with the terminal device, the terminal device may perform subsequent analysis processing based on the processed near-end audio signal, so as to respond to the voice control signal sent by the user to make a corresponding response.

In order to further understand the echo delay determination method provided in the embodiment of the present application, the echo delay determination method is applied to a real-time communication scenario as an example, and the whole echo delay determination method is described in the following by way of example. The echo delay determining method mainly comprises two stages, namely watermark information injection at a playing end and watermark information analysis at a recording end.

The principle of watermark information injection at the play end is shown in fig. 11. The method specifically comprises the following three parts:

1) source coding: watermark original information (corresponding to the watermark information in the above) is obtained, and the watermark original information may be preset text information or binary coding. When the watermark original information is subjected to source coding, the watermark original information can be firstly divided into a plurality of pieces of sub-watermark information by taking bytes as units, then, source coding is carried out on each piece of sub-watermark information to obtain a corresponding watermark source coding frame, the watermark source coding frame comprises the byte length of the watermark original information, the arrangement serial number of the sub-watermark information in the watermark original information, the content of the sub-watermark information and a check code, and the check code can be a CRC check code or a check code generated by adopting other checking modes such as grouping checking and the like. The frame structure of the watermark source coded frame may be specifically as shown in fig. 8.

2) Audio signal preprocessing: the method comprises the steps of preprocessing a received far-end audio signal, wherein the preprocessing is mainly used for detecting an energy spectrum envelope of the far-end audio signal, determining a position, exceeding a preset energy threshold, of the energy spectrum envelope in the far-end audio signal as a target position which can be used for embedding a watermark source coding frame, and marking a watermark loading enabling flag bit aiming at the target position. Therefore, watermark source coding frames are prevented from being embedded into a mute or low-energy audio signal, and effective information is prevented from being lost by a decoding end.

3) Channel coding: acquiring a watermark source coding frame generated by source coding and a far-end audio signal marked with a watermark loading enabling zone bit obtained by audio signal preprocessing, and further carrying out channel coding on the far-end audio signal based on the acquired data. In concrete implementation, when a current coding position in a far-end audio signal is marked with a watermark loading enabling flag bit, a watermark information source coding frame needs to be added into the audio signal at the current coding position through a watermark modulation algorithm to obtain a first signal to be coded, and then the first signal to be coded is subjected to channel coding to obtain a channel coding frame corresponding to the current coding position. When the current coding position in the far-end audio signal is not marked with the watermark loading enabling flag bit, the audio signal at the current coding position can be directly used as a second signal to be coded, and channel coding is carried out on the second signal to be coded, so that a channel coding frame corresponding to the current coding position is obtained.

On the basis of consideration of characteristics of an application scene, a time domain bidirectional multi-core echo hidden watermark modulation algorithm with strong robustness, small sound quality damage and low complexity is finally adopted through evaluation and comparison in the selection of the watermark modulation algorithm, the main principle of the algorithm is that watermark information is modulated into early reflected sound which cannot be distinguished by human ears by using an asynchronous masking mechanism in the auditory sense of human ears, the algorithm adopts bidirectional echoes to resist interference caused by spatial multi-path reflection, and the algorithm adopts multi-core to enhance data transmission code rate.

In order to improve the recognition rate of the watermark information analysis end and the robustness to different transmission channels, when performing channel coding, a synchronization code may be added to a frame header and an error correction code may be added to a frame tail of a generated channel coding frame, and a frame structure of the channel coding frame may be specifically as shown in fig. 9. The synchronization code is a string of fixed code words used for frame synchronization, and the specific length and content thereof can be adjusted according to the actual condition of the channel. The error correcting code has the main function of reducing the error rate of a receiving end under the condition that the signal-to-noise ratio of a channel is poor, and the embodiment of the application can specifically adopt a 31-bit BCH error correcting code more suitable for short codes.

The principle of watermark information analysis at the recording end is shown in fig. 12. The method specifically comprises the following four parts:

1) audio demodulation: and according to the frame length of the channel coding frame, demodulating the audio signal with the corresponding frame length in the collected near-end audio signal to obtain a corresponding binary bit stream.

2) Channel decoding: and carrying out channel decoding on the binary bit stream demodulated by the audio frequency, and firstly carrying out frame synchronization by depending on a synchronization code in the binary bit stream. Then correcting the error code generated in the channel transmission process by using the error correcting code in the binary bit stream; if the error correction is successful, namely the corrected bit error number does not exceed the preset bit error number, carrying out subsequent watermark demodulation processing on a channel decoding stream obtained by channel decoding; if the error correction fails, namely the corrected error code digit exceeds the preset error code digit, discarding the frame of audio data in the near-end audio signal, and waiting for decoding the next frame of audio data in the near-end audio signal.

3) Watermark demodulation: the hidden coded bit stream (i.e. the watermark source coded frame) is extracted from the channel decoded stream by using a watermark demodulation algorithm corresponding to the watermark modulation algorithm, for example, if the previously used watermark modulation algorithm is an echo hidden modulation algorithm, then the watermark source coded frame can be demodulated from the channel decoded stream by using a cepstrum method.

4) Source decoding: carrying out source decoding on the watermark source coding frame, and carrying out source side error code checking according to a check code in the watermark source coding frame; if the verification is passed, analyzing the content in the watermark information source coding frame to obtain the byte length of the original watermark information, the arrangement serial number of the sub-watermark information carried by the watermark information source coding frame in the original watermark information and the content of the sub-watermark information, and marking the position of the corresponding byte detection result as 1; if the verification fails, the frame of audio data in the near-end audio signal is abandoned, and the next frame of audio data in the near-end audio signal is waited to be decoded.

For the above described echo delay determination method, the present application also provides a corresponding echo delay determination device, so that the above echo delay determination method can be applied and implemented in practice.

Referring to fig. 13, fig. 13 is a schematic structural diagram of an echo delay determining apparatus 1300 corresponding to the echo delay determining method shown in fig. 6. As shown in fig. 13, the echo delay determination device 1300 includes:

the watermark embedding module 1301 is configured to embed watermark information in a reference audio signal to be played to obtain a target audio signal;

an audio playing module 1302, configured to play the target audio signal;

the audio acquisition module 1303 is used for acquiring a near-end audio signal;

a watermark analyzing module 1304, configured to perform watermark information analysis processing on the near-end audio signal;

an echo delay determining module 1305, configured to determine an echo delay according to a position of the watermark information in the target audio signal and a position of the watermark information in the near-end audio signal when the watermark information is analyzed from the near-end audio signal through the watermark information analyzing process.

Optionally, on the basis of the echo delay determination device shown in fig. 13, referring to fig. 14, fig. 14 is a schematic structural diagram of another echo delay determination device 1400 provided in the embodiment of the present application. As shown in fig. 14, the watermark embedding module 1301 includes:

a source coding submodule 1401, configured to perform source coding on the watermark information to obtain a watermark source coding frame;

an embedded position detection sub-module 1402, configured to detect a target position for embedding watermark information in the reference audio signal;

a channel coding sub-module 1403, configured to perform channel coding on the reference audio signal and the watermark source coding frame based on the target position in the reference audio signal, so as to obtain the target audio signal.

Optionally, on the basis of the echo delay determination apparatus shown in fig. 14, the source coding sub-module 1401 is specifically configured to:

dividing the watermark information by taking a preset byte length as a unit to obtain a plurality of pieces of sub-watermark information;

performing source coding on the sub-watermark information aiming at each sub-watermark information to obtain a watermark source coding frame corresponding to the sub-watermark information; the watermark source coding frame comprises the byte length of the watermark information, the arrangement serial number of the sub-watermark information in the watermark information, the sub-watermark information and a check code.

Optionally, on the basis of the echo delay determination apparatus shown in fig. 14, the embedded position detection sub-module 1402 is specifically configured to:

detecting an energy spectral envelope of the reference audio signal;

and determining the position of the energy spectrum envelope in the reference audio signal exceeding a preset energy threshold as the target position, and marking a watermark on the target position and loading an enabling flag bit.

Optionally, on the basis of the echo delay determining apparatus shown in fig. 14, the channel coding sub-module 1403 is specifically configured to:

judging whether the current coding position of the reference audio signal is the target position;

if so, embedding the watermark information source coding frame in the audio signal at the current coding position in the reference audio signal through a watermark modulation algorithm to obtain a first signal to be coded; carrying out channel coding on the first signal to be coded to obtain a channel coding frame corresponding to the current coding position;

if not, taking the audio signal at the current coding position in the reference audio signal as a second signal to be coded; performing channel coding on the second signal to be coded to obtain a channel coding frame corresponding to the current coding position;

and combining the channel coding frames corresponding to the coding positions in the reference audio signal to obtain the target audio signal.

Optionally, on the basis of the echo delay determining apparatus shown in fig. 14, a frame header of the channel coding frame is used for carrying a synchronization code; the frame body of the channel coding frame is used for bearing a data packet; if the channel coding frame corresponds to the target position, the data packet comprises the first signal to be coded; if the channel coding frame does not correspond to the target position, the data packet comprises the second signal to be coded; and the frame body of the channel coding frame is used for carrying an error correcting code, and the error correcting code is generated according to the information carried by the frame head and the frame body of the channel coding frame.

Optionally, on the basis of the echo delay determination apparatus shown in fig. 13, referring to fig. 15, fig. 15 is a schematic structural diagram of another echo delay determination apparatus 1500 provided in the embodiment of the present application. As shown in fig. 15, the watermark parsing module 1304 includes:

the audio demodulation submodule 1501 is configured to demodulate the near-end audio signal to obtain a binary bit stream;

a channel decoding submodule 1502 configured to perform channel decoding on the binary bit stream to obtain a channel decoded stream;

a watermark demodulation sub-module 1503, configured to perform watermark demodulation processing on the channel decoded stream through a watermark demodulation algorithm;

the source decoding submodule 1504 is configured to perform source decoding on the watermark source encoded frame to obtain the watermark information, when the watermark source encoded frame is demodulated from the channel decoded stream through the watermark demodulation processing.

Optionally, on the basis of the echo delay determining apparatus shown in fig. 15, in a case that the channel-coded frame in the target audio signal includes a synchronization code and an error correction code, the channel decoding sub-module 1502 is specifically configured to:

performing frame synchronization based on the synchronization code in the binary bit stream;

in the process of channel decoding the binary bit stream, correcting the error code in the channel decoded stream based on the error correcting code in the binary bit stream;

if the corrected bit error number does not exceed the preset bit error number, executing the watermark demodulation algorithm to perform watermark demodulation processing on the channel decoding stream; and if the corrected bit error number exceeds the preset bit error number, discarding the channel decoding stream.

Optionally, on the basis of the echo delay determining apparatus shown in fig. 15, the source decoding sub-module 1504 is specifically configured to:

checking the watermark information source coding frame according to a check code in the watermark information source coding frame;

if the verification is passed, acquiring the watermark information from the watermark information source coding frame; and if the check is not passed, discarding the watermark source coding frame.

Optionally, on the basis of the echo delay determination apparatus shown in fig. 13, the echo delay determination module 1305 is specifically configured to:

determining a time point of playing the audio frame embedded with the watermark information in the target audio signal through an audio playing channel as a first time point;

determining a time point of an audio frame which comprises the watermark information and is acquired by an audio acquisition channel and is taken as a second time point;

and calculating the time difference between the second time point and the first time point to obtain the echo delay.

determining a first time length according to the number of the audio frame embedded with the watermark information in the target audio signal and the first frame time length; the first frame duration is a length of time of each audio frame played; the first time length is used for representing the time interval length between the time of playing the audio frame embedded with the watermark information through the audio playing channel and the initial playing time of the audio signal;

determining a second time length according to the number of the audio frame including the watermark information in the near-end audio signal and the second frame time length; the second frame duration is a length of time of each captured audio frame; the second time length is used for representing a time interval between the time of acquiring the audio frame comprising the watermark information through the audio acquisition channel and the initial acquisition time of the audio signal; the initial playing time of the audio signal is the same as the initial acquisition time of the audio signal;

and calculating the difference value between the second time length and the first time length to obtain the echo delay.

Optionally, on the basis of the echo delay determination apparatus shown in fig. 13, referring to fig. 16, fig. 16 is a schematic structural diagram of another echo delay determination apparatus 1600 provided in the embodiment of the present application. As shown in fig. 16, the apparatus further includes:

an echo filtering module 1601 configured to align the near-end audio signal and the target audio signal based on the echo delay; and based on the target audio signal, performing adaptive filtering processing and nonlinear processing on the near-end audio signal to eliminate echo in the near-end audio signal.

The echo delay determining device applies an audio watermarking technology to determine echo delay, based on an auditory masking mechanism of human ears, under the condition that the audio playing quality is not influenced and the human ears are not perceived, watermark information is embedded into a reference audio signal to obtain a target audio signal, and because the acquisition from audio playing to echo forming is a closed loop, the watermark information is required to be included in a near-end audio signal acquired when the target audio signal is played, and furthermore, the echo delay can be determined according to the positions of the watermark information in the target audio signal and the near-end audio signal. On one hand, the echo delay is determined based on the watermark information in the near-end audio signal, and no special requirement is made on the signal-to-noise ratio of the near-end audio signal, so that even under the condition that various audio signals are mixed in the near-end audio signal, the echo delay can be accurately determined through the device provided by the embodiment of the application. On the other hand, for the device running the apparatus provided by the embodiment of the present application, it is not necessary to consume a large amount of computing resources to accurately determine the echo delay. On the other hand, the device provided by the embodiment of the present application has better compatibility and universality in different hardware devices and software applications, that is, for different hardware devices and software applications, the device provided by the embodiment of the present application can accurately determine the echo delay.

The embodiment of the present application further provides a device for determining echo delay, where the device may specifically be a terminal device, and the terminal device provided in the embodiment of the present application will be described below from the perspective of hardware implementation.

Referring to fig. 17, fig. 17 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in fig. 17, for convenience of explanation, only the portions related to the embodiments of the present application are shown, and details of the specific techniques are not disclosed, please refer to the method portion of the embodiments of the present application. This terminal equipment can be for including smart mobile phone, panel computer, notebook computer, desktop computer, intelligent audio amplifier, intelligent wrist-watch etc. to the terminal is smart mobile phone:

fig. 17 is a block diagram illustrating a partial structure of a smartphone related to a terminal provided in an embodiment of the present application. Referring to fig. 17, the smart phone includes: radio Frequency (RF) circuit 1710, memory 1720, input unit 1730, display unit 1740, sensor 1750, audio circuit 1760, wireless fidelity (WiFi) module 1770, processor 1780, and power supply 1790. Those skilled in the art will appreciate that the smartphone configuration shown in fig. 17 is not intended to be limiting, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The memory 1720 may be used to store software programs and modules, and the processor 1780 executes various functional applications and data processing of the smart phone by operating the software programs and modules stored in the memory 1720. The memory 1720 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, and the like), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the smartphone, and the like. Further, the memory 1720 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 1780 is a control center of the smartphone, connects various parts of the entire smartphone using various interfaces and lines, and performs various functions of the smartphone and processes data by running or executing software programs and/or modules stored in the memory 1720 and calling data stored in the memory 1720, thereby integrally monitoring the smartphone. Optionally, processor 1780 may include one or more processing units; preferably, the processor 1780 may integrate an application processor, which primarily handles operating systems, user interfaces, application programs, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1780.

In the embodiment of the present application, the processor 1780 included in the terminal further has the following functions:

playing the target audio signal; and collecting near-end audio signals;

Optionally, the processor 1780 is further configured to execute the steps of any implementation manner of the echo delay determination method provided in the embodiment of the present application.

The embodiment of the present application further provides a computer-readable storage medium for storing a computer program, where the computer program is configured to execute any one implementation of the echo delay determination method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes any one of the implementation manners of the echo delay determination method described in the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing computer programs.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for echo delay determination, the method comprising:

playing the target audio signal; and collecting near-end audio signals;

2. The method of claim 1, wherein embedding watermark information in the reference audio signal to be played to obtain the target audio signal comprises:

carrying out source coding on the watermark information to obtain a watermark source coding frame;

detecting a target position for embedding watermark information in the reference audio signal;

and performing channel coding on the reference audio signal and the watermark source coding frame based on the target position in the reference audio signal to obtain the target audio signal.

3. The method of claim 2, wherein source coding the watermark information to obtain a watermark source coded frame comprises:

4. The method of claim 2, wherein the detecting the target location for embedding watermark information in the reference audio signal comprises:

detecting an energy spectral envelope of the reference audio signal;

5. The method according to any of claims 2 to 4, wherein said channel coding the reference audio signal and the watermark source coding frame based on the target position in the reference audio signal to obtain the target audio signal comprises:

6. The method of claim 5, wherein a header of the channel-encoded frame is used to carry a synchronization code;

the frame body of the channel coding frame is used for bearing a data packet; if the channel coding frame corresponds to the target position, the data packet comprises the first signal to be coded; if the channel coding frame does not correspond to the target position, the data packet comprises the second signal to be coded;

and the frame body of the channel coding frame is used for carrying an error correcting code, and the error correcting code is generated according to the information carried by the frame head and the frame body of the channel coding frame.

7. The method according to claim 1, wherein said performing watermark information parsing on the near-end audio signal comprises:

demodulating the near-end audio signal to obtain a binary bit stream;

performing channel decoding on the binary bit stream to obtain a channel decoded stream;

performing watermark demodulation processing on the channel decoding stream through a watermark demodulation algorithm;

and carrying out source decoding on the watermark source coding frame to obtain the watermark information under the condition that the watermark source coding frame is demodulated from the channel decoding stream through the watermark demodulation processing.

8. The method of claim 7, wherein in the case that the channel-coded frame in the target audio signal comprises a synchronization code and an error correction code, the channel-decoding the binary bitstream to obtain a channel-decoded stream comprises:

9. The method as claimed in claim 7 or 8, wherein said source decoding said watermarked source encoded frame to obtain said watermark information comprises:

10. The method of claim 1, wherein determining an echo delay based on the location of the watermark information in the target audio signal and the location of the watermark information in the near-end audio signal comprises:

11. The method of claim 1, wherein determining an echo delay based on the location of the watermark information in the target audio signal and the location of the watermark information in the near-end audio signal comprises:

determining a second time length according to the number of the audio frame including the watermark information in the near-end audio signal and the second frame time length; the second frame duration is a length of time of each captured audio frame; the second duration is used for representing the time interval length between the time of acquiring the audio frame comprising the watermark information through the audio acquisition channel and the initial acquisition time of the audio signal; the initial playing time of the audio signal is the same as the initial acquisition time of the audio signal;

12. The method of claim 1, further comprising:

aligning the near-end audio signal and the target audio signal based on the echo delay;

and based on the target audio signal, performing adaptive filtering processing and nonlinear processing on the near-end audio signal to eliminate echo in the near-end audio signal.

13. An echo delay determination device, the device comprising:

the audio playing module is used for playing the target audio signal;

the audio acquisition module is used for acquiring a near-end audio signal;

14. An electronic device, comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to perform the echo delay determination method of any of claims 1 to 12 in accordance with the computer program.

15. A computer-readable storage medium for storing a computer program for executing the echo delay determination method of any one of claims 1 to 12.