CN113241086A

CN113241086A - Audio processing method and device, electronic equipment and storage medium

Info

Publication number: CN113241086A
Application number: CN202110529530.5A
Authority: CN
Inventors: 陈翔宇; 邢文浩; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-08-10
Anticipated expiration: 2041-05-14
Also published as: CN113241086B

Abstract

The present disclosure relates to an audio processing method, an audio processing apparatus, an electronic device, and a storage medium, wherein the audio processing method includes: detecting whether time delay jitter exists in the audio data acquisition and playing process; resetting the state of the acoustic echo cancellation AEC system when the presence of delay jitter is detected; performing an acoustic echo cancellation process with the state-reset AEC system, wherein the state-reset AEC system cancels acoustic echoes to a greater degree than the AEC system before the state reset.

Description

Audio processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of signal processing, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium.

Background

Conversation scenario (e.g., conferencing applications, live-talk, game speech) applications often use AEC systems to cancel the conversation peer speech and retain the near-end speech. Delay jitter typically occurs during AEC operation for different platform handsets, PCs, macOS, etc. This jitter is often caused by non-uniform system acquisition and play callback data or non-uniform system thread scheduling. However, the AEC system cannot sense this delay jitter and can only rely on the convergence of the internal acoustic model (e.g., adaptive filter), in which case leaky echoes occur.

Disclosure of Invention

The present disclosure provides an audio processing method, an electronic device and a storage medium, so as to at least solve the problem of echo leakage phenomenon occurring when there is time delay jitter in the related art.

According to a first aspect of embodiments of the present disclosure, there is provided an audio processing method, including: detecting whether time delay jitter exists in the audio data acquisition and playing process; resetting the state of the acoustic echo cancellation AEC system when the presence of delay jitter is detected; performing an acoustic echo cancellation process with the state-reset AEC system, wherein the state-reset AEC system cancels acoustic echoes to a greater degree than the AEC system before the state reset.

Optionally, the detecting whether there is a time delay jitter in the audio data acquisition and playing process includes:

marking the collected first audio data and the played second audio data by using a system time stamp, and detecting whether delay jitter exists in the audio data collection process according to the time stamps of the first audio data and the second audio data; and caching the first audio data under the condition that the second audio data is not obtained, and detecting whether delay jitter exists in the audio data playing process according to the caching state of the first audio data.

Optionally, the determining whether there is delay jitter in the audio data acquisition process according to the time stamps of the first audio data and the second audio data includes: and when the time stamp of the second audio data is not larger than the time stamp of the first audio data, determining that the time delay jitter exists in the audio data acquisition process.

Optionally, the detecting whether there is a delay jitter in the audio data playing process according to the buffer status of the first audio data includes: and when the amount of the buffered first audio data exceeds a preset buffer threshold value, determining that time delay jitter exists in the audio data playing process.

Optionally, resetting a state of the acoustic echo cancellation AEC system when the presence of the delay jitter is detected comprises: when the existence of time delay jitter is detected, setting an AEC reset mark; in the case where the time stamp of the second audio data is greater than the time stamp of the first audio data and there is a set AEC reset flag, the state of the AEC system is reset.

Optionally, the resetting the state of the acoustic echo cancellation AEC system comprises: -bringing an acoustic model for acoustic echo cancellation in the AEC system into a first state, wherein the acoustic model in the AEC system prior to a state reset is in a second state, wherein the acoustic model cancels acoustic echoes to a greater extent in the first state than in the second state.

Optionally, the performing acoustic echo cancellation processing with the AEC system after state reset includes: acoustic echo of the played audio data is cancelled from the captured audio data using an acoustic model in the AEC system in a first state.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio processing apparatus including: the detection unit is configured to detect whether time delay jitter exists in the audio data acquisition and playing process; a state reset unit configured to reset a state of the acoustic echo cancellation AEC system when presence of the delay jitter is detected; an echo cancellation unit configured to perform an acoustic echo cancellation process using the AEC system after the state reset, wherein the AEC system after the state reset cancels acoustic echoes to a greater degree than the AEC system before the state reset.

Optionally, the detecting whether there is a time delay jitter in the audio data acquisition and playing process includes: marking the collected first audio data and the played second audio data by using a system time stamp, and detecting whether delay jitter exists in the audio data collection process according to the time stamps of the first audio data and the second audio data; and caching the first audio data under the condition that the second audio data is not obtained, and detecting whether delay jitter exists in the audio data playing process according to the caching state of the first audio data.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the audio processing method as described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions, characterized in that the instructions, when executed by at least one processor, cause the at least one processor to perform the audio processing method as described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the audio processing method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: embodiments of the present disclosure avoid echo leakage by detecting delay jitter and resetting the AEC system state when delay jitter is detected to be present, and performing acoustic echo control processing with the AEC system after the state reset. Because the AEC system after the state reset has stronger elimination degree of the acoustic echo than the AEC system before the state reset, the echo leakage can be effectively avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is an exemplary system architecture to which exemplary embodiments of the present disclosure may be applied;

fig. 2 is a flowchart of an audio processing method of an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an AEC system of an exemplary embodiment of the present disclosure;

fig. 4 is a schematic diagram illustrating an example of an audio processing method of an exemplary embodiment of the present disclosure;

fig. 5 is a block diagram showing an audio processing apparatus of an exemplary embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

Fig. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. A user may use the

terminal devices

101, 102, 103 to interact with the server 105 over the network 104 to receive or send messages (e.g., video data upload requests, video data download requests), etc. Various communication client applications, such as audio and video call software, audio and video recording software, instant messaging software, conference software, mailbox clients, social platform software, and the like, may be installed on the

terminal devices

101, 102, and 103. The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and capable of playing, recording, editing, etc. audio and video, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, etc. When the

terminal device

101, 102, 103 is software, it may be installed in the electronic devices listed above, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or it may be implemented as a single software or software module. And is not particularly limited herein.

The

terminal devices

101, 102, 103 may be equipped with an image capturing device (e.g., a camera) to capture video data. In practice, the smallest visual unit that makes up a video is a Frame (Frame). Each frame is a static image. Temporally successive sequences of frames are composited together to form a motion video. Further, the

terminal apparatuses

101, 102, 103 may also be mounted with a component (e.g., a speaker) for converting an electric signal into sound to play the sound, and may also be mounted with a device (e.g., a microphone) for converting an analog audio signal into a digital audio signal to pick up the sound.

The server 105 may be a server providing various services, such as a background server providing support for multimedia applications installed on the

terminal devices

101, 102, 103. The background server can analyze, store and the like the received data such as the audio and video data uploading request, can also receive the audio and video data downloading request sent by the

terminal equipment

101, 102 and 103, and feeds back the audio and video data indicated by the audio and video data downloading request to the

terminal equipment

101, 102 and 103.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the audio processing method provided by the embodiment of the present disclosure is generally executed by a terminal device, but may also be executed by a server, or may also be executed by cooperation of the terminal device and the server. Accordingly, the audio processing means may be provided in the terminal device, in the server, or in both the terminal device and the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation, and the disclosure is not limited thereto.

Fig. 2 is a flowchart of an audio processing method of an exemplary embodiment of the present disclosure.

Referring to fig. 2, in step S210, it is detected whether there is delay jitter in the audio data collection and playing process. For example, in an actual call scenario, no matter the near end uses a microphone to collect audio data or plays audio data from the far end (opposite end) through a speaker, delay jitter may exist in the audio data collection and playing process due to uneven callback data or uneven thread scheduling. When there is delay jitter, the acoustic echo cancellation AEC system cannot sense the jitter, but only depends on convergence of an internal acoustic model (for example, but not limited to, an adaptive filter), but the convergence process of the acoustic model takes a certain time, and during the time, it is likely that the acoustic echo of the played audio data cannot be timely and effectively cancelled or suppressed, so that the acoustic echo leaks from the near end to the opposite end (i.e., a leaky echo phenomenon occurs), which may cause the effect of the audio data heard at the opposite end to be poor, and affect the call quality.

To facilitate a better understanding of the present disclosure, a brief description of the AEC system is first presented below with reference to fig. 3.

As shown in fig. 3, the far-end speech is transmitted to the system, where the system's speakers reproduce the far-end signal, while a copy of the far-end speech data is presented to the AEC system. In addition, the microphone may collect near-end speech, while at the same time the microphone may also collect acoustic echoes of far-end speech played by the loudspeaker, which together constitute a near-end signal. The near-end signal is input to the AEC, which obtains a similar acoustic echo of the far-end speech based on the far-end speech and the near-end signal by an acoustic model therein for canceling the acoustic echo, and then subtracts the acoustic echo from the near-end signal, thereby obtaining the near-end speech after suppressing or canceling the acoustic echo of the far-end speech. The system may then transmit the near-end speech to the far-end. For example, during a live conversation or a conference conversation, based on the above process, the AEC can be used to eliminate the voice signal from the far-end played by the speaker from the signal collected by the microphone as much as possible, i.e., to make the played audio signal enter as little as possible into the output signal to be sent to the far-end by the near-end system. After introducing the operating principle of the AEC system, the description of the audio processing method of the embodiment of the present disclosure is continued with reference back to fig. 2.

Specifically, in step S210, the collected first audio data and the played second audio data may be marked with a system time stamp, and whether there is delay jitter in the audio data collection process is detected according to the time stamps of the first audio data and the second audio data. Here, the collected first audio data may be audio data collected by a near-end microphone, which includes the near-end user's own audio data and acoustic echo played by a speaker from the audio data of the far-end. The played second audio data may be audio data from a far end or an opposite end played by a speaker. For example, when the time stamp of the second audio data is not greater than the time stamp of the first audio data, it is determined that there is delay jitter in the audio data collection process. This is because the causal relationship that the microphone can only acquire the acoustic echo of the played far-end audio after the speaker plays the far-end audio can be satisfied only when the time stamp of the second audio data is greater than the time stamp of the first audio data, and if the causal relationship is not satisfied, it indicates that a delay jitter may occur in the acquisition process of the audio data.

In addition, in step S210, the first audio data may also be buffered when the second audio data is not acquired, and whether a delay jitter exists in the audio data playing process may be detected according to the buffer status of the first audio data. For example, when the amount of the buffered first audio data exceeds a preset buffer threshold, it is determined that a delay jitter exists in the audio data playing process. This is because when the amount of the buffered first audio data exceeds the preset buffer threshold, it indicates that the played voice data has not been acquired within a period of time, and it is likely that a delay jitter occurs during the playing of the audio data.

In step S220, when the presence of delay jitter is detected, the state of the acoustic echo cancellation AEC system is reset. Specifically, in step S220, when it is detected that there is delay jitter, an AEC reset flag may be set, and in the case where the time stamp of the second audio data is greater than the time stamp of the first audio data and there is the set AEC reset flag, the state of the AEC system may be reset. In particular, resetting the state of the AEC system may be, for example, bringing an acoustic model for acoustic echo cancellation in the AEC system in a first state. Here, the acoustic model in the AEC system may be in a second state before the state reset, and the acoustic model cancels the acoustic echo more strongly in the first state than in the second state. The strong degree of cancellation indicates that the acoustic echo of the played second audio data can be cancelled more in the first state than in the second state. Here, the first state may be that the acoustic model re-enters a fast convergence state or an initial state, and the acoustic model has a stronger cancellation degree for the acoustic echo, for example, when the acoustic model is an adaptive filter, a filtering threshold of the filter may be larger, so that the acoustic echo may be better filtered, and although there may be a partial loss to the collected near-end audio, the acoustic echo may be effectively prevented from leaking to the opposite end. The second state may be a stable state of the acoustic model, in which case the degree of cancellation or suppression may be weaker than in the first state, although the acoustic model may also be able to cancel or suppress acoustic echoes of the far-end audio data.

After the state of the AEC system is reset, in step S230, the acoustic echo cancellation process may be performed with the AEC system after the state reset. As described above, the AEC system after the state reset cancels acoustic echoes to a greater degree than the AEC system before the state reset. Specifically, according to an exemplary embodiment, in step S230, the acoustic echo of the played audio data may be cancelled from the captured audio data using the acoustic model in the AEC system in the first state. For example, the acoustic model in the AEC system in the first state may be utilized to estimate acoustic echo of the played audio data, and the acoustic echo of the played audio data may be cancelled from the acquired audio data using the estimated acoustic echo.

The audio processing method according to the embodiment of the present disclosure has been described above with reference to fig. 2 and fig. 3, and according to the audio processing method, by detecting whether there is delay jitter during audio acquisition and playing and resetting the AEC system state when it is detected that there is delay jitter, and performing acoustic echo control processing using the AEC system after the state reset, an echo leakage phenomenon can be effectively avoided.

Fig. 4 is a schematic diagram illustrating an example of an audio processing method of an exemplary embodiment of the present disclosure. In order to facilitate an intuitive understanding of the audio processing method according to the embodiment of the present disclosure described above with reference to fig. 2, an example of the above-described audio processing method is briefly described below with further reference to fig. 4.

As shown in fig. 4, after the microphone collects audio data (hereinafter, simply referred to as collected data, that is, the above-mentioned first audio data), it may be determined whether there is played audio data (hereinafter, simply referred to as played data, that is, the above-mentioned second audio data), that is, whether the played audio data can be acquired. If the playing data cannot be acquired, the acquired data can be put into a cache at this time. Next, it may be determined whether the buffered collected data exceeds a preset buffer threshold, if so, it indicates that there is time delay jitter in the audio playing process, otherwise, the processing is ended. When the presence of delay jitter is detected, an AEC reset flag (which may also be referred to as an AEC reset flag) may be set, and then near-end speech (i.e., speech produced by the near-end user himself, excluding acoustic echoes of far-end speech) may be output.

If there is playing data, that is, if the playing data can be acquired, the collected (or buffered) data and the playing data can be read at this time, and it is determined whether the playing data timestamp is greater than the collected data timestamp. Here, the play data time stamp and the capture data time stamp are previously marked at the time of playing the audio data and capturing the audio data, respectively. If the playing data time stamp is not larger than the collecting data time stamp, it indicates that there is time delay jitter in the audio collecting process, in this case, the far-end data can be directly discarded, and the AEC reset flag is set, and at the same time, the near-end voice can be output.

If the play data timestamp is greater than the collect data timestamp, further determining if there is an AEC reset flag, if so, resetting the state of the AEC system, and performing AEC processing with the AEC after the state reset. Whereas if there is no AEC reset flag, then the AEC process is performed directly. After the AEC processing, the near-end speech can be output, so far the processing ends. Resetting the state of the AEC system and performing the AEC process have been described above and will not be described here.

In the above example of the audio processing method, it is determined in advance whether there is delay jitter in the audio acquisition and playing process by acquiring the playing timestamp and buffering the state, and then the AEC enters a state capable of eliminating far-end echo more strongly by actively resetting the state of the AEC system in the presence of delay jitter, so as to avoid the phenomenon of echo leakage caused by delay jitter.

Fig. 5 is a block diagram illustrating an audio processing apparatus according to an exemplary embodiment of the present disclosure.

Referring to fig. 5, the audio processing apparatus 500 may include a checking unit 501, a state resetting unit 502, and an echo canceling unit 503. Specifically, the detection unit 501 may detect whether there is delay jitter in the audio data acquisition and playing process. The state reset unit 502 may reset the state of the acoustic echo cancellation AEC system when the presence of delay jitter is detected. The echo cancellation unit 503 may perform acoustic echo cancellation processing using the AEC system after the state reset. Here, the AEC system after the state reset cancels acoustic echoes to a greater degree than the AEC system before the state reset.

Since the audio processing method shown in fig. 2 can be performed by the audio processing apparatus 500 shown in fig. 5, and the detection unit 501, the state resetting unit 502, and the echo canceling unit 503 can respectively perform operations corresponding to step S210, step S220, and step S230 in fig. 2, any relevant details related to the operations performed by the units in fig. 5 can be referred to in the corresponding descriptions related to fig. 2 to 4, and are not repeated here.

Furthermore, it should be noted that although the audio processing apparatus 500 is described above as being divided into units for respectively performing the corresponding processes, it is clear to those skilled in the art that the processes performed by the units described above can also be performed without any specific unit division by the audio processing apparatus 500 or without explicit demarcation between the units. In addition, the audio processing apparatus 500 may further include other units, for example, a storage unit and the like.

Referring to fig. 6, the electronic device 600 may include at least one memory 601 and at least one processor 602, the at least one memory storing computer-executable instructions that, when executed by the at least one processor, cause the at least one processor 602 to perform an audio processing method according to an embodiment of the disclosure.

By way of example, the electronic device may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In an electronic device, a processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor may execute instructions or code stored in the memory, which may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory may be integral to the processor, e.g., RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the memory may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the memory.

In addition, the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform an audio processing method according to an exemplary embodiment of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The instructions in the computer-readable storage medium or computer program described above may be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, etc., and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, there may also be provided a computer program product including computer instructions which, when executed by a processor, implement an audio processing method according to an exemplary embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An audio processing method, comprising:

detecting whether time delay jitter exists in the audio data acquisition and playing process;

resetting the state of the acoustic echo cancellation AEC system when the presence of delay jitter is detected;

performing an acoustic echo cancellation process with the state-reset AEC system, wherein the state-reset AEC system cancels acoustic echoes to a greater degree than the AEC system before the state reset.

2. The audio processing method of claim 1, wherein the detecting whether there is delay jitter in the audio data collection and playing process comprises:

marking the collected first audio data and the played second audio data by using a system time stamp, and detecting whether delay jitter exists in the audio data collection process according to the time stamps of the first audio data and the second audio data; and

and caching the first audio data under the condition that the second audio data is not acquired, and detecting whether delay jitter exists in the audio data playing process according to the caching state of the first audio data.

3. The audio processing method of claim 2, wherein the determining whether there is delay jitter in the audio data acquisition process according to the time stamps of the first audio data and the second audio data comprises:

and when the time stamp of the second audio data is not larger than the time stamp of the first audio data, determining that the time delay jitter exists in the audio data acquisition process.

4. The audio processing method according to claim 2, wherein the detecting whether there is delay jitter in the playing process of the audio data according to the buffer status of the first audio data comprises:

and when the amount of the buffered first audio data exceeds a preset buffer threshold value, determining that time delay jitter exists in the audio data playing process.

5. The audio processing method of claim 2, wherein said resetting the state of the acoustic echo cancellation AEC system when the presence of delay jitter is detected comprises:

when the existence of time delay jitter is detected, setting an AEC reset mark;

in the case where the time stamp of the second audio data is greater than the time stamp of the first audio data and there is a set AEC reset flag, the state of the AEC system is reset.

6. The audio processing method of claim 1, wherein the resetting the state of the Acoustic Echo Cancellation (AEC) system comprises: -bringing an acoustic model for acoustic echo cancellation in the AEC system into a first state, wherein the acoustic model in the AEC system prior to a state reset is in a second state, wherein the acoustic model cancels acoustic echoes to a greater extent in the first state than in the second state.

7. The audio processing method of claim 6, wherein the performing acoustic echo cancellation processing with the state-reset AEC system comprises:

acoustic echo of the played audio data is cancelled from the captured audio data using an acoustic model in the AEC system in a first state.

8. An audio processing apparatus comprising:

the detection unit is configured to detect whether time delay jitter exists in the audio data acquisition and playing process;

a state reset unit configured to reset a state of the acoustic echo cancellation AEC system when presence of the delay jitter is detected;

an echo cancellation unit configured to perform an acoustic echo cancellation process using the AEC system after the state reset, wherein the AEC system after the state reset cancels acoustic echoes to a greater degree than the AEC system before the state reset.

9. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the audio processing method of any of claims 1 to 7.

10. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the audio processing method of any of claims 1 to 7.