CN111580777B

CN111580777B - Audio processing method, device, electronic equipment and storage medium

Info

Publication number: CN111580777B
Application number: CN202010372908.0A
Authority: CN
Inventors: 邢文浩; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2024-03-08
Anticipated expiration: 2040-05-06
Also published as: CN111580777A

Abstract

The disclosure relates to an audio processing method, an audio processing device, electronic equipment and a storage medium, and belongs to the technical field of audio and video. The method comprises the following steps: receiving an audio frame sent by a sending end, and caching the currently received audio frame to a network dithering device; acquiring a first time length, a second time length and a third time length, wherein the first time length is used for indicating current network jitter, the second time length is used for indicating maximum network jitter of preset processing, and the third time length is related to the maximum storage capacity of the network jitter device; according to the current capacity of the network dithering device, the first duration, the second duration and the third duration, the processing of the audio frames cached in the network dithering device is accelerated; wherein the current capacity is used to indicate the amount of data of the audio frame currently buffered by the network dithering. The method and the device can shorten the duration of sound delay as much as possible, improve the communication experience of the two communication parties, and have better audio processing mode effect.

Description

Audio processing method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of audio and video, and in particular relates to an audio processing method, an audio processing device, electronic equipment and a storage medium.

Background

The host user and the guest user can be connected with the wheat in the live broadcast process. When the microphone is connected, the terminals of all parties can respectively collect the audio through the microphone, and the collected audio is sequentially subjected to pretreatment and coding treatment of an audio coder and then is sent to the terminals of the other parties through a network. For the receiving end, after receiving the audio frame of the transmitting end from the network, the received audio frame is temporarily buffered in a JB (Jitter Buffer) for ordering and decoding, and after the decoding process of the audio decoder, the audio frame is played by a loudspeaker. Thus, the two parties can hear each other's voices.

However, in the process of wheat linking, on the one hand, network jitter is hardly avoided due to the influence of actual conditions of the network. When the network is jittered, the receiving end may suddenly receive many audio frames at a certain time, which may cause a problem of sound delay due to audio frame accumulation. On the other hand, in some special interruption situations, such as an alarm or an incoming call, the audio playing behavior is interrupted, but at this time, audio frames come continuously, and these audio frames are put into JBs successively, so that the amount of data stored in the JBs increases suddenly. After the interruption, a large number of audio frames are accumulated in the JB, so that a large sound delay may be generated, which may have a serious adverse effect on the communication between the two parties.

In view of the above, there is a need for an audio processing scheme to solve at least the problem that when a network shakes or breaks down during live broadcasting, a large sound delay exists, which causes serious adverse effects to both parties in communication.

Disclosure of Invention

The disclosure provides an audio processing method, an audio processing device, an electronic device and a storage medium, which can shorten the duration of sound delay as much as possible when a network shakes or interrupts, thereby improving the communication experience of both communication parties. The technical scheme of the present disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided an audio processing method, including:

receiving an audio frame sent by a sending end, and caching the currently received audio frame to a network dithering device;

acquiring a first time length, a second time length and a third time length, wherein the first time length is used for indicating current network jitter, the second time length is used for indicating maximum network jitter of preset processing, and the third time length is related to the maximum storage capacity of the network jitter device;

according to the current capacity of the network dithering device, the first duration, the second duration and the third duration, the processing of the audio frames cached in the network dithering device is accelerated;

Wherein the current capacity is used to indicate the amount of data of the audio frame currently buffered by the network dithering.

In one possible implementation manner, the acquiring the first time length includes:

determining probability distribution of the arrival time interval according to the arrival time interval of each received audio frame;

and determining the first duration according to the probability distribution of the arrival time interval.

In one possible implementation manner, the accelerating processing of the audio frames buffered in the network jitter according to the current capacity of the network jitter, the first duration, the second duration, and the third duration includes:

if the duration corresponding to the current capacity is longer than the first duration and shorter than the second duration, a first processing mode is adopted to accelerate playing of the audio frames cached in the network dithering;

wherein, the first processing mode is to discard one audio frame every M1 audio frames, and M1 is a positive integer.

If the duration corresponding to the current capacity is longer than the second duration and shorter than the third duration, a second processing mode is adopted to accelerate playing of the audio frames cached in the network dithering device;

wherein, the second processing mode is to discard one audio frame every M2 audio frames, M2 is a positive integer and M2 is smaller than M1.

if the duration corresponding to the current capacity is not less than the third duration, a third processing mode is adopted to accelerate playing of the audio frames cached in the network dithering;

and the third processing mode is to discard part of the audio frames cached in the network dithering device until the residual audio frame data quantity of the network dithering device is reduced to the maximum storage capacity.

In one possible implementation manner, while receiving the audio frame sent by the sending end, the method further includes: receiving VAD (Voice Activity Detection ) and energy values of the audio frame;

The rule of discarding the audio frames buffered in the network dithering is as follows:

preferentially discarding audio frames with VAD value of zero in the network dithering device;

if the VAD values of at least two audio frames in the network dithering device are the same, the audio frames with small energy values are preferably discarded.

According to a second aspect of embodiments of the present disclosure, there is provided an audio processing apparatus comprising:

the receiving module is configured to receive the audio frame sent by the sending end and buffer the currently received audio frame to the network dithering device;

the device comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is configured to acquire a first time length, a second time length and a third time length, wherein the first time length is used for indicating current network jitter, the second time length is used for indicating maximum network jitter of preset processing, and the third time length is related to the maximum storage capacity of the network jitter device;

a processing module configured to accelerate processing of the audio frames buffered in the network dithering according to the current capacity of the network dithering, the first duration, the second duration, and the third duration;

In a possible implementation manner, the obtaining module is further configured to determine a probability distribution of the arrival time interval according to the arrival time interval of each received audio frame; and determining the first duration according to the probability distribution of the arrival time interval.

In a possible implementation manner, the processing module is further configured to, if the duration corresponding to the current capacity is greater than the first duration and less than the second duration, adopt a first processing mode to perform accelerated playback on the audio frame buffered in the network jitter device;

In a possible implementation manner, the processing module is further configured to, if the duration corresponding to the current capacity is greater than the second duration and less than the third duration, adopt a second processing mode to perform accelerated playback on the audio frame buffered in the network jitter device;

In a possible implementation manner, the processing module is further configured to, if a duration corresponding to the current capacity is not less than the third duration, adopt a third processing mode to perform accelerated playback on the audio frame buffered in the network jitter device;

In a possible implementation manner, the receiving module is further configured to receive the VAD value and the energy value of the audio frame at the same time as receiving the audio frame sent by the sending end;

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio processing method according to the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the audio processing method as described in the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, instructions in which, when executed by a processor of an electronic device, enable the electronic device to perform the audio processing method as described in the first aspect above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

after receiving the audio frame sent by the sending end, the receiving end can acquire a first time length, a second time length and a third time length besides buffering the currently received audio frame to the network dithering device, wherein the first time length is used for indicating the current network dithering, the second time length is used for indicating the maximum network dithering of the preset processing, and the third time length is related to the maximum storage capacity of the network dithering device; then, the receiving end adaptively accelerates the processing of the audio frames cached in the network dithering according to the current capacity, the first duration, the second duration and the third duration of the network dithering; wherein the current capacity is used to indicate the amount of data of the audio frame currently buffered by the network jitter. Based on the above description, the audio processing method realizes the self-adaptive adjustment of the audio processing method according to the current capacity of JB, so that even if the network shakes or breaks down, the audio frame will not generate larger sound delay when accumulating in a large amount at the receiving end. That is, the scheme can shorten the duration of the sound delay as much as possible, improves the communication experience of the two communication parties, and has better audio processing mode effect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is a schematic diagram illustrating an implementation environment involved in an audio processing method according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a method of audio processing according to an exemplary embodiment.

Fig. 3 is a flow chart illustrating a method of audio processing according to an exemplary embodiment.

Fig. 4 is a schematic diagram illustrating an IAT probability distribution according to an example embodiment.

Fig. 5 is a diagram illustrating a queue length of a network jitter according to an example embodiment.

Fig. 6 is a block diagram of an audio processing device, according to an example embodiment.

Fig. 7 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The user information referred to in the present disclosure may be information authorized by the user or sufficiently authorized by each party.

Optionally: the environment in which the technical scheme is applied is described. For example, the technical solution involving multi-terminal interaction needs to define the network environment and hardware to which the technical solution is applied before describing the solution in detail.

Before explaining embodiments of the present disclosure in detail, some abbreviations or terms that may be involved in the embodiments of the present disclosure are introduced.

In voice communication, JB is a very important module at receiving end, which is one of important factors for determining tone quality. On one hand, the method can store the received out-of-order audio packets in the correct position in the buffer, and on the other hand, the method can buffer the received audio packets in the buffer for some time so as to enable the audio playing to be smoother, thereby obtaining better voice quality.

Inter-packet arrival delay (Inter Arrival Time, IAT), in the disclosed embodiment, refers to the arrival time interval of two consecutive audio packets in milliseconds (ms).

Voice activity detection (Voice Activity Detection, VAD), a technique used for voice processing, is to detect the presence of a voice signal. Illustratively, VAD techniques may be used for speech coding and speech recognition. For example, it may simplify speech processing and may also be used to remove non-speech segments during an audio session; encoding and transmission of silence packets can also be avoided in IP telephony applications to save computation time and bandwidth.

The following describes an implementation environment related to an audio processing method provided by an embodiment of the present disclosure.

Referring to fig. 1, the implementation environment includes: a first terminal 101, a server 102 and a second terminal 103. Wherein the first terminal 101 and the second terminal 103 communicate data via the server 102.

In the embodiment of the present disclosure, the types of the first terminal 101 and the second terminal 103 are generally mobile terminal ends.

As one example, mobile terminals include, but are not limited to: smart phones, tablet computers, notebook computers, electronic readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio layer 4) players, etc.; the stationary terminal includes, but is not limited to, a desktop computer, to which the embodiments of the present disclosure are not particularly limited. By way of example, fig. 1 is merely illustrative of a first terminal 101 and a second terminal 103 each being a smartphone.

And the server 102 may provide background services for the first terminal 101 and the second terminal 103. The server 102 may be a single server or a server cluster formed by a plurality of servers, which is not limited in detail in the embodiment of the present disclosure.

The first terminal 101 and the second terminal 102 may be a transmitting end or a receiving end, respectively. Taking the first terminal 101 as a transmitting end and the second terminal 102 as a receiving end as an example, the first terminal 101 sequentially performs preprocessing (such as echo cancellation, noise reduction and the like) on the collected audio, performs coding processing of an audio coder, and then sends the audio to the second terminal 103 through a network, and after receiving audio packets, the second terminal 103 firstly places the audio packets into a JB for cache sorting, then performs decoding processing of an audio decoder, and finally plays the audio packets through a loudspeaker. When the second terminal 103 is a transmitting end and the first terminal 101 is a receiving end, the processing flow is similar.

It should be noted that, application scenarios of the audio processing method provided in the embodiments of the present disclosure include, but are not limited to:

scene 1, wheat connecting scene in live broadcast process

For such a scenario, a live application supporting a live function is generally installed on the first terminal 101 and the second terminal 103.

It is well known that a host user and a guest user can be associated with a wheat during a live broadcast. When the microphone is connected, the terminals of all parties can respectively collect the audio through the microphone, and the collected audio is sequentially subjected to pretreatment and coding treatment of an audio coder and then is sent to the terminals of the other parties through a network. For the receiving end, after receiving the audio frames of the transmitting end from the network, the received audio frames are temporarily buffered in JB for ordering decoding, and after the decoding processing of the audio decoder, the audio frames are played by a loudspeaker. Thus, the two parties can hear each other's voices.

The embodiment of the disclosure provides a JB grading acceleration scheme, which can solve the problem of generating larger sound delay when larger network jitter or similar interruption event occurs in the live broadcast and wheat connecting process. The hierarchical acceleration strategy provided by the embodiment of the disclosure can remarkably shorten the duration of the sound delay and obtain better sound quality.

Scene 2, in addition to the listed scene 1, the audio processing scheme provided by the embodiments of the present disclosure may also be applied to other voice communication scenes, such as an audio conference scene, which is not specifically limited in the embodiments of the present disclosure.

The error page presentation scheme provided by the embodiments of the present disclosure is explained in detail by the following examples.

Fig. 2 is a flowchart illustrating an audio processing method according to an exemplary embodiment, which may be used in the first terminal or the second terminal shown in fig. 1, that is, if the first terminal is a transmitting end, the second terminal is a receiving end; if the first terminal is a receiving end, the second terminal is a transmitting end, as shown in fig. 2, including the following steps.

In step 201, an audio frame sent by a sender is received, and a currently received audio frame is buffered to a network jitter.

In step 202, a first time period, a second time period and a third time period are obtained, wherein the first time period is used for indicating current network jitter, the second time period is used for indicating maximum network jitter of preset processing, and the third time period is related to the maximum storage capacity of the network jitter device.

In step 203, processing the audio frames buffered in the network dithering according to the current capacity of the network dithering, the first duration, the second duration, and the third duration; wherein the current capacity is used to indicate the amount of data of the audio frame currently buffered by the network jitter.

According to the method provided by the embodiment of the disclosure, after the receiving end receives the audio frame sent by the sending end, the receiving end caches the currently received audio frame to the network dithering device, and obtains a first time length, a second time length and a third time length, wherein the first time length is used for indicating current network dithering, the second time length is used for indicating maximum network dithering of preset processing, and the third time length is related to the maximum storage capacity of the network dithering device; then, the receiving end adaptively processes the audio frames cached in the network dithering device according to the current capacity, the first duration, the second duration and the third duration of the network dithering device; wherein the current capacity is used to indicate the amount of data of the audio frame currently buffered by the network jitter. Based on the above description, the audio processing method realizes the self-adaptive adjustment of the audio processing method according to the current capacity of JB, so that even if the network shakes or breaks down, the audio frame will not generate larger sound delay when accumulating in a large amount at the receiving end. That is, the scheme can shorten the duration of the sound delay as much as possible, improves the communication experience of the two communication parties, and has better audio processing mode effect.

According to the embodiment of the disclosure, different processing modes are adopted to accelerate playing according to different current capacities of the network dithering device, so that the duration time of sound delay can be effectively shortened accurately, and the audio processing mode has a better effect.

In one possible implementation manner, while receiving the audio frame sent by the sending end, the method further includes: receiving VAD values and energy values of the audio frames;

According to the embodiment of the disclosure, the audio frame with the VAD value of 0 is preferably discarded, and the audio frame with the small energy value is preferably discarded if the VAD values are the same, so that the audio frame with the small audio information content is selected to be discarded as far as possible, and the aim of protecting the tone quality is fulfilled.

Any combination of the above-mentioned alternative solutions may be used to form alternative embodiments of the present disclosure, which are not described herein.

Fig. 3 is a flowchart illustrating an audio processing method according to an exemplary embodiment, which may be used in the first terminal or the second terminal shown in fig. 1, that is, if the first terminal is a transmitting end, the second terminal is a receiving end; if the first terminal is a receiving end, the second terminal is a transmitting end, as shown in fig. 3, including the following steps.

In step 301, the transmitting end transmits an audio frame to the receiving end, and the VAD value and the energy value of the audio frame.

Wherein, the audio frame sent by the sender may also be referred to herein as an audio packet.

In the embodiment of the disclosure, when a transmitting end transmits an audio frame, the transmitting end counts the VAD value and the energy value of the audio frame, and transmits the audio frame, and the VAD value and the energy value of the audio frame, to a receiving end.

As one example, in counting VAD values for an audio frame, including but not limited to, the following:

the VAD algorithm may compare the feature (e.g., short-time energy, spectral entropy, or spectral entropy energy product) calculated from each audio frame with a threshold to determine whether a speech signal is included in each audio frame. That is, short-time energy, spectral entropy, or spectral entropy energy product of each audio frame, etc. may be used as the VAD value of each audio frame.

Extracting acoustic features of each audio frame; inputting acoustic features of the audio frames into a pre-trained deep neural network model, and calculating values of output nodes corresponding to the audio frames by the deep neural network model, wherein the deep neural network model is obtained by training the deep neural network according to the acoustic features and labeling data of each audio frame in a training data set; and taking the value of the output node corresponding to the audio frame as the VAD value of the audio frame. The output layer of the deep neural network model comprises two output nodes, wherein the value of one output node corresponds to the probability that the audio frame is effective voice, and the value of the other output node corresponds to the probability that the audio frame is ineffective voice.

As another example, in counting the energy value of the audio frame, including but not limited to, the following:

wherein n refers to the nth audio frame, L refers to the number of sampling points included in the nth audio frame, s (i) is the amplitude corresponding to the sampling point i, and E (n) is the energy value of the nth audio frame.

In step 302, after receiving the audio frame sent by the sender, the receiver caches the currently received audio frame in the network jitter device.

In the implementation of the present disclosure, after the receiving end receives the audio frame sent by the sending end, the VAD value and the energy value of the audio frame, the receiving end may store the VAD value and the energy value of the audio frame in the JB in addition to the currently received audio frame, which is not limited in this embodiment.

In step 303, the receiving end obtains a first duration, a second duration, and a third duration, where the first duration is used to indicate a current network jitter, the second duration is used to indicate a maximum network jitter of a preset process, and the third duration is related to a maximum storage capacity of the network jitter device.

Network jitter is used primarily to reflect the stability of the network, and is used herein to identify the degree of change in sound delay.

The first point is that the units of the first time period, the second time period, and the third time period are each milliseconds (ms).

In one possible implementation manner, the receiving end may count probability distribution of an arrival time Interval (IAT) of each audio packet according to the IAT, and further calculate current network jitter according to the probability distribution of the IAT, and record as L1; wherein the current network jitter is indicated herein with a first duration, i.e. the first duration is indicated with the symbol L1.

Correspondingly, acquiring the first time length includes: the receiving end determines probability distribution of the arrival time interval according to the arrival time interval of each received audio frame; then, the receiving end determines a first time length according to the probability distribution of the arrival time interval.

For example, assuming that there are 100 total IAT values calculated (more in practice), 60 20ms,20 40ms, and 20 80ms, respectively, the probability of 20ms is 60%,40ms is 20%, and 100ms is 20%; if 80 minutes are chosen to determine the current network jitter, the current network jitter is 40ms.

Wherein fig. 4 shows the probability distribution of IAT. In fig. 4, the horizontal axis represents the IAT value, the vertical axis represents the probability value, and the curve in fig. 4 reflects the correspondence between the IAT value and the probability value. In addition, L1 shown in fig. 4 refers to the current network jitter.

The second point to be described is that the second duration is denoted by a symbol L2, where L2 is a preset JB and is the maximum network jitter that needs to be processed separately. The maximum network jitter refers to the maximum network jitter of the preset process, that is, if the network jitter exceeds L2, the network jitter is treated as L2.

The third point to be described is that the third time period is denoted by a symbol L3 herein. Wherein, L3 is related to a preset JB maximum storage capacity; that is, L3 refers to a duration corresponding to a maximum storable data amount of the memory allocated for the JB, for example, a maximum storage capacity of the JB is 100 audio frames, and then L3 is a total duration corresponding to the 100 audio frames.

The fourth point is that the magnitude relationship among L1, L2, and L3 is L3> L2> L1.

In step 304, the receiving end processes the audio frame buffered in the network dithering according to the current capacity, the first duration, the second duration and the third duration of the network dithering; wherein the current capacity is used to indicate the amount of data of the audio frame currently buffered by the network jitter.

In the embodiment of the disclosure, different hierarchical acceleration strategies are adopted according to the difference of the current capacity of JBs so as to shorten the duration of sound delay. In addition, the current capacity of the JB may also be referred to as the JB current actual water level. The water level refers to the duration corresponding to the audio frames stored by the current JB, for example, the JB currently stores 10 audio frames, and then the current actual water level of the JB is the total duration corresponding to the 10 audio frames.

Illustratively, the audio frames buffered in the network jitter are processed according to the current capacity of the network jitter, the first duration, the second duration, and the third duration, including but not limited to the following three ways:

3041. if the duration corresponding to the current capacity is longer than the first duration and shorter than the second duration, a first processing mode is adopted to accelerate playing of the audio frames cached in the network dithering device.

That is, if the actual water level L of the current JB satisfies L1< L2, the embodiment of the present disclosure performs normal accelerated playback using the acceleration interval M1. I.e. 1 audio frame is dropped every M1 audio frames.

3042. And if the duration corresponding to the current capacity is longer than the second duration and shorter than the third duration, adopting a second processing mode to accelerate playing of the audio frames cached in the network dithering device.

The second processing mode is to discard one audio frame every M2 audio frames, M2 is a positive integer and M2 is smaller than M1. That is, if the actual water level L of the current JB satisfies L2< L3, it indicates that the data amount of the audio frames accumulated by the JB is large at this time, and the sound delay is large, which is caused by a similar interrupt event (such as an alarm or an incoming call bell) with a high probability, and the fast acceleration play mode is also required to be started. I.e. the acceleration is performed using the acceleration interval M2, in other words 1 audio frame is dropped every interval M2 audio frames.

3043. And if the duration corresponding to the current capacity is not less than the third duration, adopting a third processing mode to accelerate playing of the audio frames cached in the network dithering.

The third processing mode is to discard part of the audio frames buffered in the network dithering until the remaining audio frame data amount of the network dithering is reduced to the maximum storage capacity.

For this step, if the actual water level L of the current JB is equal to or greater than L3 when the audio frame arrives, a part of the audio frames in the JB are discarded until the remaining audio frame data amount is reduced to the maximum storage capacity of the JB.

Fig. 5 is a schematic diagram showing the actual lengths of the JB queues corresponding to the steps 3041 to 3043. Wherein L1, L2, and L3 shown in fig. 5 are the first duration, the second duration, and the third duration described above.

One possible rule for discarding audio frames for steps 3041 to 3043 described above is: preferentially discarding audio frames with VAD value of zero in the network dithering; if the VAD values of at least two audio frames in the network dithering are the same, the audio frames with small energy values are preferentially discarded. That is, in the implementation process of the steps 3041 to 3043, the audio frame with VAD value of 0 is preferably discarded, and if VAD value is the same, the audio frame with small energy value is preferably discarded, so as to achieve the goal of selecting to discard the audio frame with small audio information content as far as possible, and achieving the purpose of protecting tone quality.

The method provided by the embodiment of the disclosure has at least the following beneficial effects:

the embodiment of the disclosure realizes the self-adaptive adjustment of the audio processing mode according to the current capacity of JB, so that even if the network shakes or breaks down, the audio frame is accumulated in a large amount at the receiving end, and no larger sound delay is generated. That is, the scheme can shorten the duration of the sound delay as much as possible, improves the communication experience of the two communication parties, and has better audio processing mode effect.

In addition, the embodiment of the disclosure provides a multistage speed change scheme for executing different acceleration strategies according to the current actual water level of the JB, so that when the JB causes serious accumulation of audio frames due to network jitter or similar interruption events, the duration of sound delay can be shortened in a situation-by-situation and accurately.

In addition, when discarding audio frames, the embodiment of the disclosure may preferentially discard audio frames with VAD value of 0, and if VAD values are the same, preferentially discard audio frames with small energy value, and select to discard audio frames with small content of audio information as much as possible, thereby achieving the purpose of protecting sound quality.

Fig. 6 is a block diagram of an audio processing device, according to an example embodiment. Referring to fig. 6, the apparatus includes a receiving module 601, an acquiring module 602, and a processing module 603.

The receiving module 601 is configured to receive an audio frame sent by a sending end, and buffer the audio frame currently received to a network dithering device;

an obtaining module 602, configured to obtain a first time length, a second time length, and a third time length, where the first time length is used to indicate a current network jitter, the second time length is used to indicate a maximum network jitter of a preset process, and the third time length is related to a maximum storage capacity of the network jitter device;

a processing module 603 configured to accelerate processing of the audio frames buffered in the network jitter according to the current capacity of the network jitter, the first duration, the second duration, and the third duration;

After receiving an audio frame sent by a sending end, the device provided by the embodiment of the disclosure further obtains a first time length, a second time length and a third time length except for buffering the currently received audio frame to a network dithering device, wherein the first time length is used for indicating current network dithering, the second time length is used for indicating maximum network dithering of preset processing, and the third time length is related to the maximum storage capacity of the network dithering device; then, adaptively processing the audio frames cached in the network dithering device according to the current capacity, the first time length, the second time length and the third time length of the network dithering device; wherein the current capacity is used to indicate the amount of data of the audio frame currently buffered by the network jitter. Based on the above description, the audio processing method realizes the self-adaptive adjustment of the audio processing method according to the current capacity of JB, so that even if the network shakes or breaks down, the audio frame will not generate larger sound delay when accumulating in a large amount at the receiving end. That is, the scheme can shorten the duration of the sound delay as much as possible, improves the communication experience of the two communication parties, and has better audio processing mode effect.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 7 shows a block diagram of an electronic device 700 according to an exemplary embodiment of the present application.

The device 700 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Device 700 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the apparatus 700 includes: a processor 701 and a memory 702.

Processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 701 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 701 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 701 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 701 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. The memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement the audio processing methods provided by the method embodiments herein.

In some embodiments, the apparatus 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 703 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 704, touch display 705, camera 706, audio circuitry 707, positioning component 708, and power supply 709.

A peripheral interface 703 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 701 and memory 702. In some embodiments, the processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 704 is configured to receive and transmit RF (Radio Frequency) signals, also referred to as electromagnetic signals. The radio frequency circuitry 704 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 704 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 704 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 704 may also include NFC (Near Field Communication ) related circuitry, which is not limited in this application.

The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 705 is a touch display, the display 705 also has the ability to collect touch signals at or above the surface of the display 705. The touch signal may be input to the processor 701 as a control signal for processing. At this time, the display 705 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 705 may be one, providing a front panel of the device 700; in other embodiments, the display 705 may be at least two, disposed on different surfaces of the device 700 or in a folded design; in still other embodiments, the display 705 may be a flexible display disposed on a curved surface or a folded surface of the device 700. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The display 705 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 706 is used to capture images or video. Optionally, the camera assembly 706 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing, or inputting the electric signals to the radio frequency circuit 704 for voice communication. The microphone may be provided in a plurality of different locations of the apparatus 700 for stereo acquisition or noise reduction purposes. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 707 may also include a headphone jack.

The location component 708 is used to locate the current geographic location of the device 700 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 708 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, or the Galileo system of Russia.

The power supply 709 is used to power the various components in the device 700. The power supply 709 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 709 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the device 700 further includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyroscope sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the apparatus 700. For example, the acceleration sensor 711 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 701 may control the touch display screen 705 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 711. The acceleration sensor 711 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 712 may detect a body direction and a rotation angle of the apparatus 700, and the gyro sensor 712 may collect a 3D motion of the user on the apparatus 700 in cooperation with the acceleration sensor 711. The processor 701 may implement the following functions based on the data collected by the gyro sensor 712: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

Pressure sensor 713 may be disposed on a side frame of device 700 and/or on an underlying layer of touch display screen 705. When the pressure sensor 713 is disposed at a side frame of the apparatus 700, a grip signal of the user to the apparatus 700 may be detected, and the processor 701 performs a left-right hand recognition or a shortcut operation according to the grip signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at the lower layer of the touch display screen 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 705. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 714 is used to collect a fingerprint of the user, and the processor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 714 may be provided on the front, back, or side of the device 700. When a physical key or vendor Logo is provided on device 700, fingerprint sensor 714 may be integrated with the physical key or vendor Logo.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the touch display 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 705 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 705 is turned down. In another embodiment, the processor 701 may also dynamically adjust the shooting parameters of the camera assembly 706 based on the ambient light intensity collected by the optical sensor 715.

A proximity sensor 716, also referred to as a distance sensor, is typically provided on the front panel of the device 700. The proximity sensor 716 is used to capture the distance between the user and the front of the device 700. In one embodiment, when the proximity sensor 716 detects a gradual decrease in the distance between the user and the front face of the device 700, the processor 701 controls the touch display 705 to switch from the bright screen state to the off screen state; when the proximity sensor 716 detects that the distance between the user and the front face of the device 700 gradually increases, the processor 701 controls the touch display 705 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 7 is not limiting of the apparatus 700 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

In an exemplary embodiment, a computer readable storage medium is also provided, such as a memory, comprising instructions executable by a processor of the electronic device 700 to perform the above-described audio processing method. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, the instructions in which, when executed by a processor of the electronic device 700, enable the electronic device 700 to perform the audio processing method as in the method embodiments described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An audio processing method, comprising:

acquiring a first time length, a second time length and a third time length, wherein the first time length is used for indicating current network jitter, the first time length is determined according to probability distribution of arrival time intervals, the probability distribution of the arrival time intervals is determined according to the arrival time intervals of all audio frames, the second time length is used for indicating maximum network jitter of preset processing, and the third time length is related to the maximum storage capacity of the network jitter;

if the duration corresponding to the current capacity of the network dithering device is longer than the first duration and shorter than the second duration, a first processing mode is adopted to accelerate playing of the audio frames cached in the network dithering device; wherein, the first processing mode is to discard one audio frame every M1 audio frames at intervals, M1 is a positive integer;

If the duration corresponding to the current capacity of the network dithering device is longer than the second duration and shorter than the third duration, a second processing mode is adopted to accelerate playing of the audio frames cached in the network dithering device; wherein, the second processing mode is to discard one audio frame every interval of M2 audio frames, M2 is a positive integer and M2 is smaller than M1;

if the duration corresponding to the current capacity of the network dithering device is not less than the third duration, a third processing mode is adopted to accelerate playing of the audio frames cached in the network dithering device; the third processing mode is to discard part of the audio frames cached in the network dithering device until the residual audio frame data quantity of the network dithering device is reduced to the maximum storage capacity;

2. The audio processing method according to claim 1, wherein the method further comprises, while receiving the audio frame transmitted from the transmitting side: receiving a voice activity detection VAD value and an energy value of the audio frame;

3. An audio processing apparatus, comprising:

an acquisition module configured to acquire a first time length, a second time length and a third time length, wherein the first time length is used for indicating current network jitter, the first time length is determined according to probability distribution of arrival time intervals, the probability distribution of the arrival time intervals is determined according to arrival time intervals of all audio frames, the second time length is used for indicating maximum network jitter of preset processing, and the third time length is related to maximum storage capacity of the network jitter;

the processing module is configured to adopt a first processing mode to play the audio frames cached in the network dithering device in an acceleration way if the duration corresponding to the current capacity of the network dithering device is longer than the first duration and shorter than the second duration; wherein, the first processing mode is to discard one audio frame every M1 audio frames at intervals, M1 is a positive integer; if the duration corresponding to the current capacity of the network dithering device is longer than the second duration and shorter than the third duration, a second processing mode is adopted to accelerate playing of the audio frames cached in the network dithering device; wherein, the second processing mode is to discard one audio frame every interval of M2 audio frames, M2 is a positive integer and M2 is smaller than M1; if the duration corresponding to the current capacity of the network dithering device is not less than the third duration, a third processing mode is adopted to accelerate playing of the audio frames cached in the network dithering device; the third processing mode is to discard part of the audio frames cached in the network dithering device until the residual audio frame data quantity of the network dithering device is reduced to the maximum storage capacity;

4. The audio processing apparatus according to claim 3, wherein the receiving module is further configured to receive the voice activity detection VAD value and the energy value of the audio frame while receiving the audio frame transmitted by the transmitting end;

5. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio processing method of any of claims 1 to 2.

6. A storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the audio processing method of any one of claims 1 to 2.