CN114283837A

CN114283837A - Audio processing method, device, equipment and storage medium

Info

Publication number: CN114283837A
Application number: CN202111053462.6A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2022-04-05

Abstract

The embodiment of the application provides 5 an audio processing method, an audio processing device and a storage medium, and relates to the technical field of artificial intelligence. Compared with a continuous coding and transmission mode, the method has the advantages that the number of the coded and transmitted audio signals is reduced, and the transmission bandwidth of the audio signals is effectively reduced. Secondly, the receiving end carries out interpolation processing on the multi-frame sampling audio signals to obtain the multi-frame interpolation audio signals. And performing audio compensation on the corresponding interpolation audio signal by adopting each audio compensation value obtained by prediction to enable the obtained target audio signal to be as close to the original audio signal as possible, thereby ensuring the conversation quality under the condition of interval frame extraction.

Description

Audio processing method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence, in particular to an audio processing method, device, equipment and storage medium.

Background

In the application of voice communication, a sending end collects analog audio signals through a microphone, converts the analog audio signals into digital audio signals through an analog-digital conversion circuit, codes the digital audio signals through a coder, and finally sends the coded audio signals to a receiving end through a communication network. Correspondingly, the receiving end decodes the received encoded audio signal through the decoder to obtain the digital audio signal again, and finally, the digital audio signal is played through the loudspeaker.

However, when the audio signal is encoded in the above manner, the obtained data amount of the encoded audio signal is large, and thus, when the audio signal is transmitted to a receiving end, a large transmission bandwidth is occupied. Under the scenario of large voice call traffic or limited bandwidth, the voice call and the operation cost are easily affected, and therefore how to effectively reduce the audio coding and audio transmission bandwidth is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides an audio processing method, an audio processing device, an audio processing apparatus and a storage medium, which are used for effectively reducing audio coding and audio transmission bandwidth.

In one aspect, an embodiment of the present application provides an audio processing method, where the method includes:

receiving multi-frame sampling audio signals sent by a sending end, wherein the multi-frame sampling audio signals comprise first type sampling audio signals and second type sampling audio signals, and the first type sampling audio signals are obtained from the multi-frame original audio signals every N frames at intervals by the sending end; the second type of sampled audio signal is obtained from the multi-frame original audio signal by the sending end based on the audio characteristics of the original audio signal, and N is greater than 0;

carrying out interpolation processing on the multi-frame sampling audio signals to obtain corresponding multi-frame interpolation audio signals;

and predicting the audio compensation values corresponding to the multi-frame interpolation audio signals respectively, and performing audio compensation on the corresponding interpolation audio signals respectively by adopting the obtained audio compensation values to obtain multi-frame target audio signals.

extracting multi-frame sampling audio signals from multi-frame original audio signals, wherein the multi-frame sampling audio signals comprise first type sampling audio signals and second type sampling audio signals, and the first type sampling audio signals are obtained from the multi-frame original audio signals at intervals of N frames; the second type of sampling audio signal is obtained from the multi-frame original audio signal based on the audio characteristics of the original audio signal, and N is greater than 0;

sending the multi-frame sampled audio signals to a receiving end so that the receiving end performs interpolation processing on the multi-frame sampled audio signals to obtain corresponding multi-frame interpolated audio signals; and predicting the audio compensation values corresponding to the multi-frame interpolation audio signals respectively, and performing audio compensation on the corresponding interpolation audio signals respectively by adopting the obtained audio compensation values to obtain multi-frame target audio signals.

In one aspect, an embodiment of the present application provides an audio processing apparatus, where the apparatus includes:

the receiving module is used for receiving multi-frame sampling audio signals sent by a sending end, wherein the multi-frame sampling audio signals comprise first-class sampling audio signals and second-class sampling audio signals, and the first-class sampling audio signals are obtained from the multi-frame original audio signals every N frames at intervals by the sending end; the second type of sampled audio signal is obtained from the multi-frame original audio signal by the sending end based on the audio characteristics of the original audio signal, and N is greater than 0;

the interpolation module is used for carrying out interpolation processing on the multi-frame sampling audio signals to obtain corresponding multi-frame interpolation audio signals;

and the compensation module is used for predicting the audio compensation values corresponding to the multi-frame interpolation audio signals respectively, and performing audio compensation on the corresponding interpolation audio signals by adopting the obtained audio compensation values respectively to obtain multi-frame target audio signals.

Optionally, the audio feature includes a pitch period, where the pitch period is used to characterize a time wavelength of a pitch corresponding to the audio signal;

aiming at the multi-frame original audio signals, the sending end respectively executes the following steps:

acquiring a first pitch period of a frame of original audio signal and a second pitch period of a previous frame of original audio signal of the frame of original audio signal;

and if the difference value of the first pitch period and the second pitch period is larger than a preset threshold value, taking the frame of original audio signal as a second type of sampled audio signal.

Optionally, the interpolation module is specifically configured to:

for the multi-frame sampled audio signal, respectively performing the following steps:

respectively carrying out Fourier transform on a frame of sampled audio signal and a next frame of sampled audio signal of the frame of sampled audio signal to obtain a first real power spectrum corresponding to the frame of sampled audio signal and a second real power spectrum corresponding to the next frame of sampled audio signal;

determining a predicted power spectrum corresponding to at least one frame of compensated audio signals between the one frame of sampled audio signals and the subsequent frame of sampled audio signals based on the first and second true power spectra;

and carrying out inverse Fourier transform on the real power spectrum corresponding to each of the multi-frame sampled audio signals and each obtained predicted power spectrum to obtain multi-frame interpolated audio signals.

Optionally, the interpolation module is specifically configured to:

and respectively executing the following steps aiming at the real power values of all frequency points in the first real power spectrum:

respectively determining the predicted power value of each frequency point of the at least one frame of compensation audio signal based on the real power value of one frequency point in the first real power spectrum and the real power value of one frequency point in the second real power spectrum;

and obtaining a prediction power spectrum corresponding to each of the at least one frame of compensation audio signal based on the prediction power value of each of the at least one frame of compensation audio signal at each frequency point.

Optionally, the interpolation module is specifically configured to:

for the at least one frame of compensated audio signal, respectively performing the following steps:

and determining the predicted power value of the frame of the compensated audio signal at the frequency point based on the real power value of the frequency point in the first real power spectrum, the real power value of the frequency point in the second real power spectrum and the distance between the frame of the compensated audio signal and the frame of the sampled audio signal.

Optionally, the audio compensation value is a power gain value;

the compensation module is specifically configured to:

acquiring target power spectrums corresponding to the multi-frame interpolation audio signals respectively;

and respectively predicting the power gain value of each frequency point in the target power spectrum corresponding to each multi-frame interpolation audio signal by adopting the trained deep learning network.

Optionally, a module training module is further included;

the module training module is specifically configured to:

acquiring sample data, wherein the sample data comprises a plurality of frames of original sample audio signals and a plurality of frames of interpolated sample audio signals, each frame of original sample audio signal corresponds to a first sample power spectrum, and each frame of interpolated sample audio signal corresponds to a second sample power spectrum; the power gain marking value of each frequency point in one second sample power spectrum is the ratio of the first sample power value of each frequency point in the corresponding first sample power spectrum to the second sample power value of the corresponding frequency point in the one second sample power spectrum;

and performing at least one iterative training on the deep learning network to be trained based on the sample data, and outputting the trained deep learning network, wherein in each iterative training, a target loss function for parameter adjustment is determined based on the power gain predicted value of each frequency point in the selected at least one second sample power spectrum and the power gain marking value of each frequency point in the at least one second sample power spectrum.

Optionally, the compensation module is specifically configured to:

respectively executing the following steps for the multi-frame interpolation audio signal:

performing audio compensation on the power value of the corresponding frequency point in the target power spectrum corresponding to the frame of interpolated audio signal by adopting the power gain value of each frequency point in the target power spectrum corresponding to the frame of interpolated audio signal to obtain a compensation power spectrum corresponding to the frame of interpolated audio signal;

performing inverse Fourier transform on the compensation power spectrum corresponding to the frame of interpolation audio signal to obtain a compensation audio signal corresponding to the frame of interpolation audio signal;

and obtaining multi-frame target audio signals based on the compensation audio signals corresponding to the multi-frame interpolation audio signals respectively.

Optionally, the compensation module is specifically configured to:

and multiplying the power gain value of each frequency point in the target power spectrum corresponding to the frame of interpolation audio signal by the power value of the corresponding frequency point in the target power spectrum corresponding to the frame of interpolation audio signal to obtain the compensation power spectrum corresponding to the frame of interpolation audio signal.

the sampling module is used for extracting multi-frame sampling audio signals from multi-frame original audio signals, wherein the multi-frame sampling audio signals comprise first type sampling audio signals and second type sampling audio signals, and the first type sampling audio signals are obtained from the multi-frame original audio signals at intervals of N frames; the second type of sampling audio signal is obtained from the multi-frame original audio signal based on the audio characteristics of the original audio signal, and N is greater than 0;

the transmitting module is used for transmitting the multi-frame sampled audio signals to a receiving end so that the receiving end carries out interpolation processing on the multi-frame sampled audio signals to obtain corresponding multi-frame interpolated audio signals; and predicting the audio compensation values corresponding to the multi-frame interpolation audio signals respectively, and performing audio compensation on the corresponding interpolation audio signals respectively by adopting the obtained audio compensation values to obtain multi-frame target audio signals.

the sampling module is specifically configured to:

respectively executing the following steps for the multiple frames of original audio signals:

In one aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the audio processing method when executing the program.

In one aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program executable by a computer device, and when the program runs on the computer device, the computer device is caused to execute the steps of the audio processing method.

In the embodiment of the application, a sending end extracts a frame of sampling audio signal every N frames of original audio signals at intervals, and simultaneously extracts the original audio signal of which the audio characteristics meet the characteristic change condition as the sampling audio signal, so that the characteristics of the original audio are reserved under the condition of realizing the interval frame extraction. Compared with a continuous coding and transmission mode, the method has the advantages that the quantity of coded and transmitted audio signals is greatly reduced, and further the transmission bandwidth of the audio signals is effectively reduced. Secondly, the receiving end carries out interpolation and compensation on the received frame extraction audio signal, and restores the frame extraction audio signal into a complete multi-frame target audio signal, thereby ensuring the conversation quality under the condition of frame extraction at intervals.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an audio processing method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a chat interface of an instant messaging application according to an embodiment of the present application;

fig. 4 is a schematic diagram of a voice call interface of an instant messaging application according to an embodiment of the present application;

fig. 5 is a schematic flowchart of an audio processing method according to an embodiment of the present application;

fig. 6 is a schematic diagram of a power spectrum of an original audio signal according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a power spectrum of a sampled audio signal provided by an embodiment of the present application;

FIG. 8a is a schematic diagram of a power spectrum of an interpolated audio signal according to an embodiment of the present application;

fig. 8b is a schematic diagram of a power spectrum of a target audio signal obtained by compensation according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a deep learning network according to an embodiment of the present disclosure;

fig. 10 is a schematic flowchart of a training method for a deep learning network according to an embodiment of the present disclosure;

fig. 11 is a flowchart illustrating an audio compensation method according to an embodiment of the present application;

fig. 12 is a flowchart illustrating an audio processing method according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

For convenience of understanding, terms referred to in the embodiments of the present invention are explained below.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. The scheme provided by the embodiment of the application relates to the machine learning technology of artificial intelligence.

Cloud conference: the cloud conference is an efficient, convenient and low-cost conference form based on a cloud computing technology. A user can share voice, data files and videos with teams and clients all over the world quickly and efficiently only by performing simple and easy-to-use operation through an internet interface, and complex technologies such as transmission and processing of data in a conference are assisted by a cloud conference service provider to operate. At present, domestic cloud conferences mainly focus on Service contents mainly in a Software as a Service (SaaS a Service) mode, including Service forms such as telephones, networks and videos, and cloud computing-based video conferences are called cloud conferences. The audio processing method in the embodiment of the application can be suitable for cloud conferences.

And (3) voice coding and decoding: in the application of voice communication, a sending end collects analog audio signals through a microphone, then converts the analog audio signals into digital audio signals through an analog-digital conversion circuit, codes the digital audio signals through a coder, and finally packs and sends the coded audio signals to a receiving end through a communication network transmission format and a protocol. Correspondingly, the receiving end unpacks and outputs the coded audio signal after receiving the data packet, the coded audio signal is decoded through a decoder, the digital audio signal is obtained again, and finally the digital audio signal is played through a loudspeaker.

The following is a description of the design concept of the embodiments of the present application.

At present, the voice encoding and decoding technology is widely applied to scenes such as voice call application, video call application and the like. However, when the audio signal is encoded in the related art, the data size of the encoded audio signal obtained is large, and thus, when the audio signal is transmitted to a receiving end, a large transmission bandwidth is occupied. Under the scenario of large voice call traffic or limited bandwidth, the voice call and the operation cost are easily affected, and therefore how to effectively reduce the voice coding and transmission bandwidth is an urgent problem to be solved.

It has been found through analysis that the related art generally employs a continuous encoding method when encoding an audio signal. For example, after continuously acquiring multiple frames of original audio signals, the sending end encodes each frame of original audio signal, and then sends each frame of encoded original audio signal to the receiving end, that is, each frame of original audio signal needs to consume transmission bandwidth. However, some original audio signals have similarities and correlations, and if a part of the original audio signals is selected from the similar original audio signals to be encoded and sent to the receiving end, the receiving end can restore all the original audio signals based on the part of the original audio signals, so that the voice encoding and transmission bandwidth of the audio signals can be reduced, and the quality of the call service is ensured.

In view of this, an embodiment of the present application provides an audio processing method, in which a sending end extracts a plurality of sampled audio signals from a plurality of original audio signals, where the plurality of sampled audio signals include a first type of sampled audio signal and a second type of sampled audio signal, and the first type of sampled audio signal is obtained from the plurality of original audio signals every N frames; a second type of sampled audio signal is derived from a plurality of frames of the original audio signal based on audio characteristics of the original audio signal, N being greater than 0.

The sending end sends the multi-frame sampling audio signals to the receiving end, and the receiving end carries out interpolation processing on the multi-frame sampling audio signals to obtain corresponding multi-frame interpolation audio signals. And then predicting audio compensation values corresponding to the multi-frame interpolation audio signals respectively, and performing audio compensation on the corresponding interpolation audio signals respectively by adopting the obtained audio compensation values to obtain multi-frame target audio signals.

Referring to fig. 1, it is a diagram of a system architecture to which the audio processing method provided in the embodiment of the present application is applicable, where the architecture includes at least a sending end 101, a server 102, and a receiving end 102.

The sending end 101 and the receiving end 102 may install target applications with call functions, such as a voice call application, a video call application, an instant messaging application, a live broadcast application, a vehicle-mounted application, and the like. The target application may be in the form of a client application, a web page version application, an applet application, or the like. The transmitting end 101 and the receiving end 102 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart car system, an automatic driving system, an auxiliary driving system, and the like.

The server 102 may be a background server of the target application, and provides a corresponding service for the target application, and the server 102 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. The sending end 101 and the server 103 may be directly or indirectly connected through a wired or wireless communication manner, and the receiving end 102 and the server 103 may be directly or indirectly connected through a wired or wireless communication manner, which is not limited herein.

In the process of a call, as shown in fig. 2, a sending end collects continuous multiple frames of original audio signals, and extracts multiple frames of sampled audio signals from the multiple frames of original audio signals, where the multiple frames of sampled audio signals include a first type of sampled audio signal and a second type of sampled audio signal. The first type of sampled audio signal is obtained from a plurality of original audio signals every N frames, and the second type of sampled audio signal is obtained from a plurality of original audio signals based on audio characteristics of the original audio signals, where N is greater than 0.

And the sending terminal performs voice coding and channel coding on the obtained multi-frame sampled audio signals and then sends the coded multi-frame sampled audio signals to the server through the communication network. And the server transmits the encoded multi-frame sampling audio signal to a receiving end.

The receiving end respectively carries out channel decoding and voice decoding on the encoded multi-frame sampled audio signals, and then carries out interpolation processing on the decoded multi-frame sampled audio signals to obtain corresponding multi-frame interpolated audio signals. And then predicting audio compensation values corresponding to the multi-frame interpolation audio signals, performing audio compensation on the corresponding interpolation audio signals by adopting the obtained audio compensation values to obtain multi-frame target audio signals, and finally playing the obtained target audio signals.

In practical application, the technical scheme provided by the embodiment of the application can be suitable for all one-to-one or multi-people call service scenes, such as a voice call scene, a video call scene, a live broadcast scene, a broadcast scene and the like.

For example, both the sending end and the receiving end are pre-installed with an instant messaging application, and the instant messaging application has both a voice call function and a video call function.

As shown in fig. 3, the sender displays a chat interface of the instant messaging application, and the chat interface includes a picture sharing icon 301, a picture taking icon 302, a voice call icon 303, and a location sharing icon 304. The sending end responds to the operation of clicking the voice call icon 303 by the user a, establishes voice call connection with the receiving end, and displays a voice call interface of the instant messaging application, wherein the voice call interface is shown in fig. 4, and the voice call interface comprises head portraits of the user a and the user B and three call function options of hanging up, muting and hands-free. When the user a and the user B perform voice call on the voice call interface, reference is made to the description of fig. 2 for the processing procedure of the corresponding sending end and receiving end to the audio, which is not described herein again.

For example, the sending end is an intelligent vehicle-mounted system, and a vehicle-mounted application is pre-installed on the sending end, and the vehicle-mounted application has a voice call function and a video call function. The receiving end is a smart phone and is provided with a video call application. The intelligent vehicle-mounted system responds to the video call operation triggered by the user A in the vehicle-mounted application, establishes video call connection with the intelligent mobile phone and displays a video call interface of the vehicle-mounted application. The smart phone responds to the answering operation triggered by the user B in the video call application and displays a video call interface of the video call application. When the user a and the user B perform a video call, the audio processing process of the corresponding smart car-mounted system and the smart phone is described with reference to fig. 2, which is not described herein again.

Based on the system architecture diagram shown in fig. 1, an embodiment of the present application provides a flow of a video processing method, and as shown in fig. 5, the flow of the method may be interactively executed by the sending end 101 and the receiving end 102 shown in fig. 1, and includes the following steps:

in step S501, the transmitting end extracts multiple frames of sampled audio signals from multiple frames of original audio signals.

Specifically, the sending end collects continuous audio signals through a microphone, and then divides the continuous audio signals into multiple frames of original audio signals in forms of windowing or segmenting, and the frame length of each frame of original audio signals can be set according to actual conditions, for example, the frame length is 20 ms.

Before sampling a plurality of frames of original audio signals, echo cancellation and denoising can be performed on the original audio signals to improve the quality of the original audio signals. The multi-frame sampled audio signal comprises a first type of sampled audio signal and a second type of sampled audio signal. The first type of sampled audio signal is obtained from a plurality of original audio signals every N frames, and the second type of sampled audio signal is obtained from a plurality of original audio signals based on audio characteristics of the original audio signals, where N is greater than 0.

When the value of N is larger, the lower the audio encoding rate and the transmission bandwidth, the higher the requirement on the reduction capability of the receiving end, and therefore, N may select a suitable value according to the actual call effect and the call service requirement, and of course, may set different values for different call services. The original audio signal with the audio characteristics meeting the characteristic change condition can also be called as an audio characteristic mutation frame, and the audio characteristic mutation frame is also extracted when the sampled audio signal is extracted at intervals, so that a receiving end can restore the complete audio signal conveniently, and the conversation quality is ensured.

The repeated sampled audio signals may be de-duplicated when repeated sampled audio signals are present in the sampled audio signals of the first type and the sampled audio signals of the second type. In addition, when sampling multiple frames of original audio signals, fourier transform may be performed on the original audio signals to obtain power spectrums of the original audio signals in the frequency domain, and then the multiple frames of original audio signals may be sampled based on the power spectrums corresponding to the multiple frames of original audio signals, respectively. Of course, a plurality of frames of original audio signals may also be directly sampled in the time domain, and this is not limited in this application.

Exemplarily, the corresponding power spectrum of 7 frames of original audio signals in the frequency domain is set as shown in fig. 6, wherein the abscissa represents the frame number, which is the 1 st frame, the 2 nd frame, …, and the 7 th frame, respectively. The ordinate represents frequency points, which are frequency point 1, frequency point 2, …, and frequency point 10, respectively. Each square indicates the power value of one frame of original audio signal at one frequency point, for example, the square 601 indicates the power value of the 7 th frame of original audio signal at the frequency point 10.

The method comprises the steps of firstly extracting a power spectrum corresponding to a 1 st frame of original audio signal, extracting a power spectrum corresponding to a 7 th frame of original audio signal at intervals of 5 frames, extracting a power spectrum corresponding to a 3 rd frame of original audio signal as well as a sampling result as shown in fig. 7, and reserving the 1 st frame of original audio signal, the 3 rd frame of original audio signal and the 7 th frame of original audio signal as sampling audio signals, wherein the 1 st frame of original audio signal and the 7 th frame of original audio signal are first-class sampling audio signals, and the 3 rd frame of original audio signal is a second-class sampling audio signal.

Step S502, the sending end sends the multi-frame sampling audio signal to the receiving end.

Specifically, the sending end may perform automatic gain control on the obtained multi-frame sampled audio signal, adjust the sampled audio signal to an appropriate volume, and then encode the sampled audio signal, or may directly encode the obtained multi-frame sampled audio signal. Since the multi-frame sampled audio signal in the present application is not a continuous audio signal, the sampled audio signal may be encoded by an independent frame coding method, such as an iLBC coding method.

The sending end can send the multi-frame sampling audio signal to the receiving end through the one-stage or multi-stage server, and can also directly send the multi-frame sampling audio signal to the receiving end.

Step S503, the receiving end carries out interpolation processing on the multi-frame sampling audio signal to obtain a corresponding multi-frame interpolation audio signal.

Specifically, the fourier transform may be performed on the sampled audio signals to obtain power spectrums of the sampled audio signals in the frequency domain, and then the interpolation processing may be performed on the multiple frames of sampled audio signals based on the respective corresponding power spectrums of the multiple frames of sampled audio signals. Of course, the multi-frame sampled audio signal may also be directly interpolated in the time domain, and this is not limited in this application.

The number of the interpolated audio signals obtained after the interpolation processing may be the same as or less than the number of the original audio signals collected by the transmitting end.

Step S504, the receiving end predicts respective corresponding audio compensation values of the multi-frame interpolation audio signals, and performs audio compensation on the respective interpolation audio signals by using the respective obtained audio compensation values to obtain multi-frame target audio signals.

Specifically, the receiving end may use a deep learning Network to predict audio compensation values corresponding to each of the multiple frames of interpolated audio signals, where the deep learning Network includes, but is not limited to, a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN), and a Long Short-Term Memory Network (LSTM).

And acquiring a power spectrum of the interpolated audio signal in a frequency domain, and then performing audio compensation on the interpolated audio signal based on the power spectrum corresponding to the interpolated audio signal, wherein the audio compensation value is a power gain value. In addition, the interpolation audio signal can also be directly subjected to audio compensation in the time domain, and a compensated target audio signal is obtained. The compensated target audio signal may then be played through a speaker.

Optionally, in step S501, the audio feature includes a pitch period, where the pitch period is used to characterize a time wavelength of a pitch corresponding to the audio signal.

When a sending end extracts a second type of sampling audio signal from a plurality of frames of original audio signals based on the audio characteristics of the original audio signals, aiming at the plurality of frames of original audio signals, the following steps are respectively executed:

a first pitch period of a frame of an original audio signal and a second pitch period of a previous frame of the original audio signal are obtained. And if the difference value of the first pitch period and the second pitch period is larger than a preset threshold value, taking the frame of original audio signal as a second type of sampled audio signal.

Specifically, the pitch period can be extracted from the encoding process of the encoder, or can be detected by using a period detection algorithm, wherein the period detection algorithm includes autocorrelation-based pitch period detection, cepstrum-based pitch period detection, pitch period detection based on linear prediction encoding, and the like.

The preset threshold is an empirical value and is related to the sampling rate of the audio signal, and the preset threshold is larger when the sampling rate is higher. And when the difference value of the first pitch period and the second pitch period is larger than a preset threshold value, setting a mutation flag of the original audio signal to be 1 to indicate that the original audio signal is an audio characteristic mutation frame, and extracting the original audio signal to be used as a second type of sampling audio signal. And when the difference value between the first pitch period and the second pitch period is less than or equal to a preset threshold value, setting a mutation flag of the original audio signal to be 0, indicating that the original audio signal is a non-mutation frame, and not extracting the original audio signal.

In the embodiment of the application, the pitch period is adopted to represent the audio characteristics of the original audio signal, and when the variation degree of the pitch period of the original audio signal is greater than the preset threshold value, the original audio signal is used as a sampling audio signal, so that the acquired sampling audio signal can represent the acquired original audio signal more comprehensively, and further, the receiving end can restore a more complete target audio signal based on the sampling audio signal, and the effect of audio conversation is improved.

Optionally, in step S503 above, an embodiment of the present application provides an implementation manner of performing interpolation processing on a multi-frame sampled audio signal in a frequency domain, which is specifically as follows:

for a plurality of frames of sampled audio signals, respectively performing the following steps: and respectively carrying out Fourier transform on a frame of sampled audio signal and a frame of sampled audio signal after the frame of sampled audio signal to obtain a first real power spectrum corresponding to the frame of sampled audio signal and a second real power spectrum corresponding to the frame of sampled audio signal after the frame of sampled audio signal. A predicted power spectrum corresponding to at least one frame of compensated audio signal between the frame of sampled audio signal and a subsequent frame of sampled audio signal is then determined based on the first and second true power spectra. And performing inverse Fourier transform on the real power spectrum corresponding to each sampled audio signal of the multiple frames and each obtained predicted power spectrum to obtain the multi-frame interpolation audio signal.

Specifically, the power spectrum in the embodiment of the present application may be a linear power spectrum, a logarithmic power spectrum, a bark domain power spectrum, or the like. The first real power spectrum comprises the power value of one frame of sampled audio signal at a plurality of frequency points, the second real power spectrum comprises the power value of the next frame of sampled audio signal of one frame of sampled audio signal at a plurality of frequency points, and the distance between the two frames of sampled audio signals is the difference value of the frame number of the two frames of sampled audio signals and then is reduced by one. The interval of the two frames of sampled audio signals also represents the amount of the original audio signal that is not decimated between the two frames of sampled audio signals at the time of sampling. When the frame interpolation processing is performed on the sampled audio signals, the number of compensated audio signals obtained by frame interpolation between two frames of sampled audio signals is less than or equal to the distance between the two frames of sampled audio signals. Both the sampled audio signal and the obtained compensated audio signal are taken as interpolated audio signals.

Optionally, when predicting the predicted power spectrum corresponding to the compensated audio signal, for the real power values of the frequency points in the first real power spectrum, the following steps are respectively performed:

and respectively determining the predicted power value of each frequency point of at least one frame of compensation audio signal based on the real power value of one frequency point in the first real power spectrum and the real power value of the frequency point in the second real power spectrum, and then obtaining the predicted power spectrum corresponding to each frame of compensation audio signal based on the predicted power value of each frequency point of each frame of compensation audio signal.

Specifically, for each compensation audio signal, the predicted power value of the compensation audio signal at a frequency point is determined based on the known real frequency value of the sampled audio signal at the frequency point, and then the predicted power spectrum corresponding to the compensation audio signal is obtained based on the predicted power values of the compensation audio signal at all frequency points.

Since the distances between each of the compensated audio signals and the sampled audio signal are different, and the correlation between the compensated audio signal and the sampled audio signal is also different when the distances are different, the distances between the compensated audio signal and the sampled audio signal need to be considered when predicting the predicted power spectrum corresponding to the compensated audio signal.

Specifically, for at least one frame of compensated audio signal, the following steps are respectively performed:

and determining the predicted power value of the frame of the compensated audio signal at the frequency point based on the real power value of the frequency point in the first real power spectrum, the real power value of the frequency point in the second real power spectrum and the distance between the frame of the compensated audio signal and the sampled audio signal. Then, based on the predicted power values of the frame compensation audio signal at all frequency points, a predicted power spectrum corresponding to the compensation audio signal is obtained, and the method can be specifically expressed as the following formula (1):

wherein, X (i, j + n) represents the predicted power value of the nth frame compensation audio signal after the jth frame sampled audio signal at the ith frequency point, Y (i, j) represents the real power value of the jth frame sampled audio signal at the ith frequency point, Y (i, j +1) represents the real power value of the jth frame sampled audio signal at the ith frequency point, n represents the distance between the jth frame sampled audio signal and the jth +1 frame sampled audio signal, i is greater than 0, j is greater than 0, n is greater than 0, and m is n + 1.

After the predicted power spectrum corresponding to at least one frame of compensation audio signal between two frames of sampling audio signals is obtained, the power spectrum of the multi-frame interpolation audio signal is obtained based on the real power spectrum corresponding to each of the two frames of sampling audio signals and the predicted power spectrum corresponding to at least one frame of compensation audio signal.

And then performing inverse Fourier transform on the real power spectrum corresponding to each of the multi-frame sampled audio signals and each of the obtained predicted power spectrums to obtain multi-frame interpolated audio signals. And then predicting audio compensation values corresponding to the multi-frame interpolation audio signals in the time domain, and performing audio compensation on the corresponding interpolation audio signals by adopting the obtained audio compensation values to obtain multi-frame target audio signals. Or predicting respective corresponding audio compensation values of the multi-frame interpolation audio signals in the frequency domain based on the power spectrum of the multi-frame interpolation audio signals, and performing audio compensation on the corresponding interpolation audio signals by adopting the respective obtained audio compensation values to obtain multi-frame target audio signals.

Illustratively, as shown in fig. 8a, the 1 st frame original audio signal, the 3 rd frame original audio signal, and the 7 th frame original audio signal are set to be extracted from the 7 th frame original audio signal as sample audio signals.

The distance between the 1 st frame original audio signal and the 3 rd frame original audio signal is 1, the predicted power value of the 1 st compensation audio signal (i.e. the compensation audio signal 1) after the 1 st frame original audio signal at each frequency point can be determined by adopting the formula (1), and then the predicted power spectrum corresponding to the compensation audio signal 1 is determined based on the obtained predicted power value.

The distance between the 3 rd frame original audio signal and the 7 th frame original audio signal is 3, the predicted power value of the 1 st compensated audio signal (i.e. the compensated audio signal 2) after the 3 rd frame original audio signal at each frequency point can be determined by adopting the formula (1), and then the predicted power spectrum corresponding to the compensated audio signal 2 is determined based on the obtained predicted power value. Similarly, the predicted power spectrums corresponding to the 2 nd compensated audio signal (i.e., compensated audio signal 3) and the 3 rd compensated audio signal (i.e., compensated audio signal 4) after the 3 rd frame original audio signal can be determined in the same manner. After the interpolation is finished, the power spectrum corresponding to each of the 7 frames of interpolated audio signals can be obtained.

Based on the power spectrum corresponding to each of the 7 frames of interpolated audio signals shown in fig. 8a, the audio compensation values corresponding to each of the 7 frames of interpolated audio signals in the frequency domain are predicted, and the obtained audio compensation values are used to perform audio compensation on the corresponding interpolated audio signals, respectively, where the corresponding power spectrum of the 7 frames of target audio signals obtained by the audio compensation is shown in fig. 8 b.

In the embodiment of the application, the predicted power spectrum corresponding to the compensation audio signal between the sampled audio signals is predicted based on the known real power spectrum corresponding to the sampled audio signals, and then the complete audio is restored based on the real power spectrum corresponding to the sampled audio signals and the predicted power spectrum corresponding to the compensation audio signals, so that the better speech intelligibility is kept under the lower encoding rate.

Optionally, in step S504, in order to improve the accuracy of audio compensation, after the target power spectrums corresponding to the multiple frames of interpolated audio signals are obtained, a trained deep learning network may be used to respectively predict the power gain values of each frequency point in the target power spectrums corresponding to the multiple frames of interpolated audio signals.

In the following, a possible deep learning network is taken as an example to describe the technical solution of the embodiment of the present application. Referring to fig. 9, a schematic structural diagram of a deep learning network provided in the embodiment of the present application is shown, where the deep learning network may include an input DENSE (fully connected layer) Unit, a two-layer GRU (Gate recovery Unit) network Unit, and an output DENSE Unit.

And inputting the target power spectrums corresponding to the N frames of interpolation audio signals into a deep learning network through an input DENSE unit, and respectively extracting the characteristics of the N target power spectrums through two layers of GRU network units to obtain N interpolation audio characteristics. And the output DENSE unit predicts and outputs power gain values of each frequency point in a target power spectrum corresponding to each N frames of interpolation audio signals based on the N interpolation audio features.

Before the deep learning network is put into use, the deep learning network needs to be trained first, so the training process of the deep learning network is described below. Please refer to fig. 10, which is a schematic diagram of a training process of the deep learning network.

Step S1001: and acquiring sample data.

Specifically, the sample data includes multiple frames of original sample audio signals and multiple frames of interpolated sample audio signals, where each frame of original sample audio signal corresponds to a first sample power spectrum, each frame of interpolated sample audio signal corresponds to a second sample power spectrum, and a power gain flag value of each frequency point in one second sample power spectrum is a ratio of a first sample power value of each frequency point in the corresponding first sample power spectrum to a second sample power value of a corresponding frequency point in the second sample power spectrum.

Step S1002: and determining the power gain predicted value of each frequency point in the selected at least one second sample power spectrum by using the deep learning network to be trained.

Step S1003: and determining a target loss function for parameter adjustment based on the power gain prediction value of each frequency point in the selected at least one second sample power spectrum and the power gain marking value of each frequency point in the at least one second sample power spectrum.

In particular, negative log-cross entropy may be employed as an objective loss function for deep learning networks.

Generally speaking, when the obtained power gain predicted value is smaller than the power gain mark value, for example, the power gain mark value is 1 and the power gain predicted value is 0.95, or the power gain mark value is 0 and the power gain predicted value is 0.02, then the smaller the loss value of the deep learning network obtained by using the target loss function is, the more the predicted power gain value of the deep learning network is closer to the true value, and therefore the accuracy is higher.

Step S1004: and determining whether the deep learning network converges according to the target loss function.

Step S1005: and when the deep learning network is determined not to be converged, adjusting the model parameters of the deep learning network according to the target loss function.

Step S1006: and when the deep learning network is determined to be converged, finishing the training and outputting the trained deep learning network.

In the embodiment of the present application, when the loss value is less than the set loss threshold, it indicates that the accuracy of the deep learning network can meet the requirement, so that the deep learning network can be determined to converge, and conversely, when the loss value is not less than the set loss threshold, it indicates that the accuracy of the deep learning network cannot meet the requirement, so that further parameter adjustment needs to be performed on the deep learning network, and a subsequent training process is performed on the deep learning network after the parameter adjustment, that is, the processes of steps S1002 to S1004 are repeatedly performed.

In the embodiment of the application, after the trained deep learning network is obtained, the trained deep learning network can be used to respectively predict the power gain value of each frequency point in the target power spectrum corresponding to each of the multi-frame interpolation audio signals. Then, the following possible implementation modes are adopted, and the audio compensation is respectively carried out on the corresponding interpolation audio signals based on the obtained audio compensation values, so as to obtain multi-frame target audio signals:

specifically, for a plurality of frames of interpolated audio signals, the following steps are respectively performed:

and performing audio compensation on the power value of the corresponding frequency point in the target power spectrum corresponding to the frame of interpolated audio signal by adopting the power gain value of each frequency point in the target power spectrum corresponding to the frame of interpolated audio signal to obtain a compensation power spectrum corresponding to the frame of interpolated audio signal. And then performing inverse Fourier transform on the compensation power spectrum corresponding to the frame of interpolation audio signal to obtain a compensation audio signal corresponding to the frame of interpolation audio signal. And then obtaining multi-frame target audio signals based on the compensation audio signals corresponding to the multi-frame interpolation audio signals respectively.

In specific implementation, the power gain value of each frequency point in the target power spectrum corresponding to one frame of interpolated audio signal may be multiplied by the power value of the corresponding frequency point in the target power spectrum corresponding to the frame of interpolated audio signal, respectively, to obtain the compensation power spectrum corresponding to the frame of interpolated audio signal. Or setting a weight for each frequency point in a target power spectrum corresponding to a frame of interpolated audio signal, multiplying the power gain value of each frequency point in the target power spectrum corresponding to a frame of interpolated audio signal by the corresponding weight to obtain a weight power gain value, and then multiplying the obtained weight power gain value by the power value of the corresponding frequency point in the target power spectrum corresponding to the frame of interpolated audio signal to obtain a compensation power spectrum corresponding to the frame of interpolated audio signal.

Further, performing inverse fourier transform on the compensation power spectrum corresponding to the interpolated audio signal to obtain a compensation audio signal corresponding to the interpolated audio signal, where the phase value of the compensation audio signal is the phase value of the sampled audio signal that is received most recently.

For example, as shown in fig. 11, the deep learning network may include an input density (density) Unit, a two-layer GRU (gated round robin) network Unit, and an output density Unit.

Aiming at each frame of interpolation audio signal in N frames of interpolation audio signals, the power gain value of each frequency point in the target power spectrum corresponding to the frame of interpolation audio signal is multiplied by the power value of the corresponding frequency point in the target power spectrum corresponding to the frame of interpolation audio signal, so as to obtain the compensation power spectrum corresponding to the frame of interpolation audio signal. And then performing inverse Fourier transform on the compensation power spectrum corresponding to the frame of interpolation audio signal to obtain a compensation audio signal corresponding to the frame of interpolation audio signal. And obtaining N frames of target audio signals based on the compensation audio signals corresponding to the N frames of interpolation audio signals respectively.

In the embodiment of the application, the deep learning network is adopted to predict the audio compensation values corresponding to the multi-frame interpolation audio signals, and the obtained audio compensation values are adopted to perform audio compensation on the corresponding interpolation audio signals respectively, so that the obtained multi-frame target audio signals are close to the multi-frame original audio signals as much as possible, better conversation quality is kept under the condition of frame extraction coding, and meanwhile, the transmission bandwidth of the audio signals is reduced.

To better explain the embodiment of the present application, the following describes a flow of an audio processing method provided by the embodiment of the present application with reference to a specific implementation scenario, where the method is performed by a sending end, a server, and a receiving end interactively, as shown in fig. 12, and includes the following steps:

step S1201, the transmitting end collects continuous multi-frame original audio signals.

In step S1202, the transmitting end extracts a plurality of frames of sampled audio signals from a plurality of frames of original audio signals.

The multi-frame sampled audio signals comprise first-class sampled audio signals and second-class sampled audio signals, wherein the first-class sampled audio signals are original audio signals every N frames and are obtained from the multi-frame original audio signals; the second type of sampled audio signal is obtained from the plurality of frames of original audio signals based on audio features of the original audio signals, and N is greater than 0.

In step S1203, the sending end performs speech coding and channel coding on the obtained multiple frames of sampled audio signals respectively by using an independent frame coding mode.

In step S1204, the transmitting end transmits the encoded multi-frame sampled audio signal to the server.

In step S1205, the server transmits the encoded multi-frame sampled audio signal to the receiving end.

In step S1206, the receiving end performs channel decoding and speech decoding on the encoded multi-frame sampled audio signal, respectively.

Step S1207, the receiving end performs interpolation processing on the decoded multi-frame sampled audio signal to obtain a corresponding multi-frame interpolated audio signal.

Specifically, for each frame of sampled audio signals, a first true power spectrum corresponding to the frame of sampled audio signals and a second true power spectrum corresponding to a next frame of sampled audio signals are obtained. And then substituting the real power value of each frequency point in the first real power spectrum, the real power value of each frequency point in the second real power spectrum and the distance between the two frames of sampled audio signals into the formula (1) to obtain a predicted power spectrum corresponding to at least one frame of compensation audio signal between the two frames of sampled audio signals. And taking the real power spectrum corresponding to the two frames of sampled audio signals and the predicted power spectrum corresponding to at least one frame of compensated audio signal as the target power spectrum corresponding to each of the multi-frame interpolated audio signals obtained by interpolation processing.

And step S1208, the receiving end respectively predicts the power gain value of each frequency point in the target power spectrum corresponding to each multi-frame interpolation audio signal by adopting the trained deep learning network.

Step S1209, the receiving end performs audio compensation on the corresponding interpolated audio signal respectively by using each obtained power gain value, so as to obtain multi-frame target audio signals.

Specifically, the power gain value of each frequency point in the target power spectrum corresponding to the frame interpolation audio signal is multiplied by the power value of the corresponding frequency point in the target power spectrum corresponding to the frame interpolation audio signal, so as to obtain the compensation power spectrum corresponding to the frame interpolation audio signal. And performing inverse Fourier transform on the compensation power spectrum corresponding to the frame of interpolation audio signal to obtain a compensation audio signal corresponding to the frame of interpolation audio signal. And taking the compensation audio signals corresponding to the multi-frame interpolation audio signals as target audio signals.

Based on the same technical concept, the embodiment of the present application provides a schematic structural diagram of an audio processing apparatus, as shown in fig. 13, the apparatus 1300 includes:

a receiving module 1301, configured to receive a plurality of frames of sampled audio signals sent by a sending end, where the plurality of frames of sampled audio signals include a first type of sampled audio signal and a second type of sampled audio signal, and the first type of sampled audio signal is obtained from the plurality of frames of original audio signals every N frames of intervals by the sending end; the second type of sampled audio signal is obtained from the multi-frame original audio signal by the sending end based on the audio characteristics of the original audio signal, and N is greater than 0;

the interpolation module 1302 is configured to perform interpolation processing on the multiple frames of sampled audio signals to obtain corresponding multiple frames of interpolated audio signals;

and the compensation module 1303 is configured to predict audio compensation values corresponding to the multiple frames of interpolated audio signals, and perform audio compensation on the corresponding interpolated audio signals by using the obtained audio compensation values, so as to obtain multiple frames of target audio signals.

Optionally, the interpolation module 1302 is specifically configured to:

Optionally, the audio compensation value is a power gain value;

the compensation module 1303 is specifically configured to:

Optionally, a module training module 1304 is also included;

the module training module 1304 is specifically configured to:

Optionally, the compensation module 1303 is specifically configured to:

Based on the same technical concept, the embodiment of the present application provides a schematic structural diagram of an audio processing apparatus, as shown in fig. 14, the apparatus 1400 includes:

a sampling module 1401, configured to extract multiple frames of sampled audio signals from multiple frames of original audio signals, where the multiple frames of sampled audio signals include a first type of sampled audio signal and a second type of sampled audio signal, and the first type of sampled audio signal is obtained from the multiple frames of original audio signals every N frames; the second type of sampling audio signal is obtained from the multi-frame original audio signal based on the audio characteristics of the original audio signal, and N is greater than 0;

a sending module 1402, configured to send the multiple frames of sampled audio signals to a receiving end, so that the receiving end performs interpolation processing on the multiple frames of sampled audio signals to obtain corresponding multiple frames of interpolated audio signals; and predicting the audio compensation values corresponding to the multi-frame interpolation audio signals respectively, and performing audio compensation on the corresponding interpolation audio signals respectively by adopting the obtained audio compensation values to obtain multi-frame target audio signals.

the sampling module 1401 is specifically configured to:

Based on the same technical concept, the embodiment of the present application provides a computer device, as shown in fig. 15, including at least one processor 1501 and a memory 1502 connected to the at least one processor, where a specific connection medium between the processor 1501 and the memory 1502 is not limited in the embodiment of the present application, and the processor 1501 and the memory 1502 are connected through a bus in fig. 15 as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.

In the embodiment of the present application, the memory 1502 stores instructions executable by the at least one processor 1501, and the at least one processor 1501 can execute the steps of the audio processing method by executing the instructions stored in the memory 1502.

The processor 1501 is a control center of the computer device, and may be connected to various parts of the computer device through various interfaces and lines, and perform audio processing by executing or executing instructions stored in the memory 1502 and calling data stored in the memory 1502. Alternatively, the processor 1501 may include one or more processing units, and the processor 1501 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, and the like, and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1501. In some embodiments, the processor 1501 and the memory 1502 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The processor 1501 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, configured to implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

The memory 1502, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 1502 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 1502 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1502 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, which, when the program is run on the computer device, causes the computer device to perform the steps of the audio processing method described above.

It should be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An audio processing method, comprising:

2. The method of claim 1, wherein the audio features include a pitch period, the pitch period being used to characterize a temporal wavelength of a corresponding pitch of the audio signal;

the second type of sampled audio signal is obtained from the multiple frames of original audio signals by the sending end based on the audio characteristics of the original audio signals, and includes:

3. The method of claim 1, wherein said interpolating the plurality of frames of sampled audio signals to obtain a corresponding plurality of frames of interpolated audio signals comprises:

4. The method of claim 3, wherein determining a predicted power spectrum corresponding to at least one frame of compensated audio signals between the one frame of sampled audio signals and the next frame of sampled audio signals based on the first true power spectrum and the second true power spectrum comprises:

5. The method of claim 4, wherein the determining the predicted power values of the at least one frame of compensated audio signals at the frequency point respectively based on the real power values of the frequency point in the first real power spectrum and the real power values of the frequency point in the second real power spectrum comprises:

6. The method of any of claims 1 to 5, wherein the audio compensation value is a power gain value;

the predicting the audio compensation values corresponding to the multiple frames of interpolated audio signals respectively comprises:

7. The method of claim 6, wherein the trained deep learning network is obtained by training in the following manner:

8. The method as claimed in claim 6, wherein the audio compensation is performed on the corresponding interpolated audio signal by using the obtained audio compensation values to obtain multi-frame target audio signals, respectively, and comprises:

9. The method according to claim 8, wherein the performing audio compensation on the power value of the corresponding frequency point in the target power spectrum corresponding to the frame of interpolated audio signal by using the power gain value of each frequency point in the target power spectrum corresponding to the frame of interpolated audio signal to obtain the compensated power spectrum corresponding to the frame of interpolated audio signal comprises:

10. An audio processing method, comprising:

11. The method of claim 10, wherein the audio feature comprises a pitch period, the pitch period being used to characterize a temporal wavelength of a corresponding pitch of the audio signal;

the second type of sampled audio signal is obtained from the original audio signals of the plurality of frames based on the audio characteristics of the original audio signals, and includes:

12. An audio processing apparatus, comprising:

13. An audio processing apparatus, comprising:

14. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1 to 11 are performed when the program is executed by the processor.

15. A computer-readable storage medium, storing a computer program executable by a computer device, the program, when executed on the computer device, causing the computer device to perform the steps of the method of any one of claims 1 to 11.